Flexible instruction processor systems and methods

ABSTRACT

A design system for generating configuration information and associated executable code based on a customisation specification, which includes application information including application source code and customisation information including design constraints, for implementing an instruction processor using re-programmable hardware, the system comprises a template generator for generating a template for each processor style identified as a candidate for implementation; an analyser for analysing instruction information for each template and determining instruction optimisations; a compiler for compiling the application source code to include the instruction optimisations and generate executable code; an instantiator for analysing architecture information for each template, determining architecture optimisations and generating configuration information including the architecture optimisations; and a builder for generating device-specific configuration information from the configuration information including the architecture optimisations. In another aspect, a management system for managing run-time re-configuration of an instruction processor implemented using re-programmable hardware, comprises a configuration library containing configuration information for a plurality of instruction processor implementations; a code library for containing associated executable code for the implementations; a loader for loading application data and, as required, configuration information and associated executable code into re-programmable hardware for implementation and execution of an instruction processor; a loader controller for signalling the loader to load application data and, as required, configuration information and associated executable code, and execute the executable code; a run-time monitor for obtaining run-time statistics relating to operation of the instruction processor; an optimisation determiner configured to receive the run-time statistics, and being operable to instruct the loader to load new configuration information and associated executable code for a new implementation into the re-programmable hardware; and an optimisation instructor for invoking the optimisation determiner.

RELATED APPLICATION DATA

This application is a continuation of U.S. patent application Ser. No.10/416,977, filed Oct. 30, 2003, which is a national phase ofInternational Patent Application No. PCT/GB01/05080 filed Nov. 19, 2001,which U.S. patent application issued on Jun. 2, 2009 as U.S. Pat. No.7,543,283, all of which are hereby incorporated by reference in theirentireties.

FIELD OF THE INVENTION

The present invention relates to the design-time and run-timeenvironments of re-programmable instruction processors, such instructionprocessors being referred in the specification as flexible instructionprocessors (FIPs).

In one aspect the present invention relates to a FIP design system forgenerating FIP configuration information and associated executable FIPcode for FIP implementations based on user-specified customisationspecifications, and a related method of generating FIP configurationinformation and associated executable FIP code for FIP implementationsbased on user-specified customisation specifications.

In another aspect the present invention relates to a FIP managementsystem for managing the run-time adaptation of the FIP configurationinformation and associated executable FIP code of a FIP implementation,and a related method of managing the adaptation of the FIP configurationinformation and associated executable FIP code of a FIP implementationduring run-time.

BACKGROUND

General-purpose instruction processors, such as those from AMDCorporation (US) and Intel Corporation (US), have dominated computingfor a long time. However, such processors have fixed architectures, andtend to lose performance when dealing with non-standard operations andnon-standard data which are not supported by the instruction set formats[1].

The need for customising instruction processors for specificapplications is particularly acute in embedded systems, such as cellphones, medical appliances, digital cameras and printers [2].

It is possible to develop customised integrated circuits for executingprograms written in a specific language. An example is the GMJ30501SBprocessor which is customised for executing the Java language (HelbornElectronics, Segyung, Korea). However, the design and fabrication ofsuch integrated circuits is still expensive, and, once designed, itscustomised function is fixed and cannot be altered.

Re-programmable hardware, such as Field-Programmable Gate Arrays (FPGAs)from Xilinx Inc. (San Jose, Calif., US) or Complex Programmable LogicDevices (CPLDs) from Altera Corporation (San Jose, Calif. US), providesa means of implementing instruction processors using standard,off-the-shelf components. The use of such devices not only eliminatesthe risks associated with integrated circuit design and fabrication, butalso opens up the possibilities of having a customisable processor.

One route to supporting customisation is to augment an instructionprocessor with programmable logic for implementing custom instructions.Several vendors are offering a route to such implementations [3-5]. Theprocessors involved are usually based on existing architectures, such asthose from ARM, IBM and MIPS. These fixed instruction processor coresare interfaced to programmable logic, which provides the resources forimplementing a set of custom instructions for a given application. Onesuch implementation utilises one or more FPGAs as programmable executionunits in combination with a defined execution unit [6]. This systemincludes a data processor which comprises a defined execution unit whichis coupled to internal buses of the processor for execution of apre-defined set of instructions, combined with one or more programmableexecution units which are coupled to the internal buses for execution ofprogrammed instructions. This approach does not, however, encompasscustomising the overall architecture of the defined execution unit andprogrammable execution units, or the set of tools supporting suchcustomisation.

Another route to supporting customisation of instruction processors isto implement instruction processors using existing FPGAs [7]. With suchan implementation, it is possible to customise the entire instructionprocessor at compile time [8] or at run time [9, 10]. An automatedmethod for instruction processor design and optimisation based oncapturing the instruction interpretation process as a parallel programhas been developed [11], and a number of instruction processors havebeen implemented [12-14], although the performance of these processorshas not been reported.

A further prior art approach involves methods and tools for theautomatic generation of configurable processors at design time [15].This approach does not, however, encompass methods and tools for theautomatic generation of processors customisable both at design and runtime.

Automatic methods for producing compile-time and run-time customisabledatapaths have also been developed [16], but the instruction setarchitectures (ISAs) in these designs are fixed, as implemented on acommercial microprocessor, for example, and the architecture is notcustomisable.

SUMMARY OF THE INVENTION

It is an aim of the present invention to provide design-time andrun-time environments for flexible instruction processors (FIPs) whichprovide for user customisation and design-time optimisation, in order toexploit the re-programmability of current and future generations ofre-programmable hardware.

Accordingly, the present invention provides a system for and method ofautomatically generating a customisable processor and the executableprocessor code. Both the code and the processor can be customised bothat design time and run time according to a user-provided customisationspecification.

FIPs advantageously provide a means of creating customised processorswhich can be tuned for specific applications. FIPs are assembled from askeletal processor template, which comprises modules which areinterconnected by communication channels, and a set of parameters. Thetemplate can be used to produce different processor implementations,such as different processor styles, for example, stack-based orregister-based styles, by varying the parameters for that template, andby combining and optimising existing templates. The parameters for atemplate are selected to transform a skeletal processor into a processorsuited to a particular application. When a FIP is assembled, requiredinstructions are retrieved from a library that contains implementationsof these instructions in various styles. Depending on which instructionsare included, resources such as stacks and different decode units areinstantiated, with the communication channels providing a mechanism fordependencies between instructions and resources to be mitigated.

As compared to a direct hardware implementation, FIPs have the addedoverheads of instruction fetch and decode. However, FIPs have manyadvantages.

FIPs allow customised hardware to be accommodated as new instructions.This combines the efficient and structured control path associated withan instruction processor with the benefits of hand-crafted hardware. Theprocessor and its associated opcodes provide a means of optimisingcontrol paths through optimising the compilers.

Critical resources can be increased as demanded by the applicationdomain, and eliminated if not used. Instruction processors provide astructure for these resources to be shared efficiently, and the degreeof sharing can be determined at run time.

FIPs enable high-level data structures to be easily supported inhardware and also help preserve current software investments andfacilitate the prototyping of novel architectures, such as abstractmachines for real arithmetic and declarative programming [13].

In particular, the FIP approach of the present invention enablesdifferent implementations of a given instruction set with differentdesign trade-offs. These implementations can also be related bytransformation techniques [11], which provide a means of verifyingnon-obvious, but efficient, implementations.

The efficiency of an implementation is often highly dependent on thestyle of the processor selected. Specialised processor styles, such asthe Three Instruction Machine (TIM) [13], are designed specifically toexecute a specific language. Even processor templates which are designedfor more general application, such as the stack-based Java VirtualMachine (JVM) or register-based MIPS, are more efficient for differenttasks. Hence, for a given application, the selection of the processorstyle is an important decision. Issues such as the availability ofresources, the size of the device and the speed requirements areeffected by the decision.

Different styles of processors support different customisableinstruction formats. These processors also have different trade-offs insize, speed and ease of hardware re-programmability. For example,register-style implementations of JVM are fast but large, whilestack-style implementations of JVM are slower and smaller. A processorlibrary containing information for producing different styles ofprocessors is used in the generation of customised processors at designand run time.

It is also possible to generate new processors and the correspondingcode by combining different styles of processors. This allows thesystematic development of complex processors and the corresponding codesby combining simpler processors.

Related tools including compilers, assemblers, linkers, disassemblers,debuggers, instruction set simulators and other facilities are providedfor optimising performance, reducing size, reducing power consumption,etc. This optimisation can be achieved by, for instance, reducing thefrequency of re-configuration or reducing the wire congestion in theprogrammable hardware.

The present invention supports optimisation by run-time customisation.Run-time customisation includes: (a) reducing resource usage byre-programming the hardware so that only the essential elements arepresent in the hardware at a particular time; (b) optimising performanceand usage by adapting the programmable hardware to run-time conditionsnot known at design time; and (c) optimising performance and usage byconditionally downloading new code and/or new hardware at run time froman external source, such as the internet.

For a given customisation specification and processor library, anembodiment of the present invention provides means for generating: (a) aplurality of hardware descriptions of a customisable processor, eachrepresenting a possible customised version of the processor tuned tospecific run-time conditions; (b) customised tools for developing andoptimising code executable on the hardware descriptions, and informationto allow optimised combination of such code at run time; and (c)hardware and software mechanisms for selecting hardware and code to runat a particular instant at run time. The selection can be decided by theuser at design time, or can be influenced by run-time conditions.

As compared to direct hardware implementation, FIPs have the addedoverhead of instruction fetch and execute. VLIW and EPIC architecturesare attempts to reduce the ratio of the number of fetches to the numberof executions. Customising instructions is also a technique to reducethe fetch and execute ratio and increase the performance of the FIP. Theconcept of incorporating custom instructions in a FIP has been reported[4, 10]. Custom instructions are typically hand crafted and incorporatedduring FIP instancing. While hand-crafted custom instructions providefor the best performance, these instructions are difficult to create,and require a skilled engineer with good knowledge of the system. Thepresent invention provides a technique which automatically createscustom instructions through opcode chaining and other optimisations inorder to improve. This technique can be used both at compile time and atrun time.

In a preferred embodiment, at run time and for given application data,the hardware and/or software mechanisms can adopt one or both of adifferent customised processor or a different piece of code containinginstructions to deal with application data, depending on the user and/orrun-time conditions. The selection can take into account the speed,size, and power consumption of the current and the re-programmedimplementations, and also the re-programming time. The compiled code cancontain information to enable the creation or retrieval of thecorresponding customised processor. If a piece of compiled code isencountered where the customised processor for its execution does notexist, such information will enable, for instance, the processor to beretrieved by loading the same from a network source.

The present invention also provides for the run-time adaptation of FIPs.The run-time adaptability of a FIP system allows the system to evolve tosuit the requirements of the user, typically by performing automaticrefinement based on instruction usage patterns.

The techniques and tools that have been developed include: (a) arun-time environment that manages the re-configuration of a FIP so as toexecute applications as efficiently as possible; (b) mechanisms foraccumulating run-time metrics and analysing the metrics to allow therun-time environment to request automatic refinements; and (c)customisation techniques for automatically customising a FIP to anapplication.

The run-time approach of the present invention is one of a modularframework which is based on FIP templates which capture variousinstruction processor styles, such as stack-based or register-basedstyles, enhancements which improve functionality and performance, suchas processor templates which provide for superscalar and hybridoperation, compilation strategies involving standard compilers andFIP-specific compilers, and technology-independent andtechnology-specific optimisations, such as techniques for efficientresource sharing in FPGA implementations.

Predicting the run-time characteristics of a FIP system over a period oftime is extremely difficult. For instance, the Advanced EncryptionStandard (AES) [17] utilises a range of block sizes for differentsituations, and thus the processing required is situation dependent.Typically, where a long re-configuration time is undesirable, one canuse a generic FIP that supports different AES block sizes moderatelyefficiently without requiring run-time re-configuration. Otherwise, onecan use different FIP implementations which are customised to rundifferent AES modes highly efficiently, and re-configure the FIP asrequired.

In one embodiment the FIP system can be automatically re-configuredbased on instruction usage patterns. For instance, different web sitesdeploy different encryption methods, and frequent visitors to particularweb sites may use FIPs optimised for specific operations.

The present invention provides a FIP run-time adaptation system whichprovides: (a) a run-time environment which manages the re-configurationof a FIP so as to execute a given application as efficiently aspossible; (b) mechanisms for accumulating run-time metrics and analysingthe metrics to allow the run-time environment to request automaticrefinements; and (c) a customisation system for automaticallycustomising a FIP to an application.

As set out hereinabove, FIPs provide a well-defined control structurewhich facilitates varying the degree of sharing for system resources.This allows critical resources to be increased as demanded by theapplication domain, and eliminated if not used. FIPs also provide asystematic method for supporting customisation by allowing user-designedhardware to be accommodated as new instructions. These design-timeoptimisations provide a means of tailoring an instruction processor to aparticular application or to applications for a specific domain, such asimage processing.

Run-time adaptation allows for the further fine tuning of FIPs torun-time changes by exploiting the upgradability of re-programmablehardware, such as FPGAs. The present invention provides a FIP frameworkwhich simplifies re-configurability by providing a means of refiningFIPs at both compile time and run time.

The ability to adapt a FIP system to the changing behaviour ofapplications is a powerful feature, but there are significant technicalrequirements to providing a working system. These requirements include:(a) the ability to create a plurality of FIP designs at compile time orrun time; (b) managing the library of FIPs; and (c) ensuring thatperformance of the system is not diminished by providing addedflexibility.

In this regard, an approach is developed which encompasses the followingcomponents: (i) a design tool for facilitating the creation ofcustomised FIPs at compile time; (ii) a scheme for track available FIPdesigns and machine code; (iii) a run-time system for managing the FIPstate and configuration; (iv) a metric used to decide ifre-configuration is a suitable option at a given time; (v) aspecification defining which run-time statistics are required forrefinement analysis; (vi) a monitor for accumulating run-timestatistics; and (vii) a tool for automatically customising a FIP basedon the accumulated run-time statistics. Note that optimisation analysisand automatic refinement steps are optional and are included when theiroverheads can be tolerated.

In one aspect the present invention provides a design system forgenerating configuration information and associated executable codebased on a customisation specification, which includes applicationinformation including application source code and customisationinformation including design constraints, for implementing aninstruction processor using re-programmable hardware, the systemcomprising: a template generator for generating a template for eachprocessor style identified as a candidate for implementation; ananalyser for analysing instruction information for each template anddetermining instruction optimisations; a compiler for compiling theapplication source code to include the instruction optimisations andgenerate executable code; an instantiator for analysing architectureinformation for each template, determining architecture optimisationsand generating configuration information, preferably domain-specificconfiguration information, including the architecture optimisations inorder to instantiate the parameters for each template; and a builder forgenerating device-specific configuration information from theconfiguration information including the architecture optimisations.

Preferably, the system further comprises: a selector for profiling theconfiguration information and associated code for each candidateimplementation, and selecting one or more optimal implementations basedon predeterminable criteria.

Preferably, the application information further includes applicationdata.

More preferably, the application data includes data representative ofdata to be operated on by the instruction processor.

Still more preferably, the application data includes data representativeof a range of run-time conditions.

Preferably, the customisation information further includes at least onecustom instruction.

More preferably, each custom instruction can be defined as mandatory oroptional.

Preferably, the customisation information further identifies at leastone processor style as a candidate for implementation.

Preferably, the system further comprises: a profiler for profilinginformation in the customisation specification and identifying at leastone processor style as a candidate for implementation.

More preferably, the profiled information includes the applicationsource code.

Preferably, the profiler is configured to identify a plurality ofprocessor styles as candidates for implementation.

In one embodiment ones of the processor styles are identified to executeparts of an application, whereby the application is to be executed bycombined ones of the processor styles.

Preferably, the profiler is further configured to collect profilinginformation for enabling optimisation.

Preferably, the profiling information includes frequency of groups ofopcodes.

Preferably, the profiling information includes information regardingoperation sharing.

Preferably, the profiling information includes information regardingoperation parallelisation.

In one embodiment the analyser is configured to utilise the profilinginformation in analysing the instruction information, and determine theinstruction optimisations therefrom.

Preferably, the instruction optimisations include operationoptimisations.

More preferably, the operation optimisations include operation sharingoptimisations.

More preferably, the operation optimisations include operationparallelisation optimisations.

Preferably, the instruction optimisations include custom instructions.

In one embodiment the analyser is configured to identify candidateinstruction optimisations, and determine implementation of theinstruction optimisations based on estimations performed by theinstantiator.

Preferably, where the estimations from the instantiator provide that there-programmable hardware cannot be programmed to implement allinstructions together during run time, the analyser groups combined onesof instructions into sets of instructions which can be implemented byre-programming of the re-programmable hardware.

In one embodiment the analyser is configured to determine a plurality ofimplementations for different run-time conditions, each havinginstruction optimisations associated with the run-time conditions, andgenerate decision condition information associated with eachimplementation, which decision condition information enables selectionbetween the implementations depending on actual run-time conditions.

Preferably, where the instruction optimisations cannot provide animplementation which complies with design constraints, the analyser isconfigured to invoke the profiler to re-profile the customisationspecification based on analysis information provided by the analyser.

Preferably, the architecture optimisations performed by the instantiatorinclude pipelining.

Preferably, the architecture optimisations performed by the instantiatorinclude resource replication.

Preferably, the architecture optimisations performed by the instantiatorinclude technology independent optimisations.

Preferably, the technology independent optimisations include removal ofunused resources.

Preferably, the technology independent optimisations include opcodeassignment.

Preferably, the technology independent optimisations include channelcommunication optimisations.

Preferably, the technology independent optimisations includecustomisation of data and instruction paths.

In one embodiment, where a plurality of configurations of there-programmable hardware are required to implement the instructionprocessor, the instantiator is configured to optimise ones of theconfigurations into groups and schedule implementation of the groupedconfigurations.

Preferably, the system further comprises: a library containing processordefinitions and associated parameters for a plurality of processorstyles; and wherein the template generator is configured to generatetemplates from processor definitions and associated parameters extractedfrom the library.

Preferably, the processor styles include superscalar processors.

Preferably, the processor styles include hybrid processors.

In one embodiment the compiler is generated by the analyser, and theapplication source code is annotated with customisation information forcompilation by the compiler to provide an optimised executable code.

In another embodiment the compiler is configured to compile theapplication source code and re-organise the compiled source code toincorporate optimisations to provide an optimised executable code.

Preferably, the configuration information and associated executablecode, and, where relevant, the decision condition information, aredeployed in at least one management system which is for managingadaptation and configuration of instruction processors implemented usingre-programmable hardware.

Preferably, the configuration information and associated executablecode, and, where relevant, the decision condition information, aredeployed in at least one library for enabling re-programming ofre-programmable hardware.

In one embodiment the re-programmable hardware comprises at least onefield programmable gate array.

In another embodiment the re-programmable hardware comprises at leastone complex programmable logic device.

Preferably, the instruction processor is fully implemented using there-programmable hardware.

In another aspect the present invention provides a method of generatingconfiguration information and associated executable code based on acustomisation specification, which includes application informationincluding application source code and customisation informationincluding design constraints, for implementing an instruction processorusing re-programmable hardware, the method comprising the steps of:generating a template for each processor style identified as a candidatefor implementation; analysing instruction information for each templateand determining instruction optimisations; compiling the applicationsource code to include the instruction optimisations and generateexecutable code; analysing architecture information for each templateand determining architecture optimisations; generating configurationinformation, preferably domain-specific configuration information,including the architecture optimisations in order to instantiate theparameters for each template; and generating device-specificconfiguration information from the configuration information includingthe architecture optimisations.

Preferably, the method further comprises the steps of: profiling theconfiguration information and associated code for each candidateimplementation; and in response thereto, selecting one or more optimalimplementations based on predeterminable criteria.

Preferably, the application information further includes applicationdata.

More preferably, the application data includes data representative ofdata to be operated on by the instruction processor.

Still more preferably, the application data includes data representativeof a range of run-time conditions.

Preferably, the customisation information further includes at least onecustom instruction.

More preferably, each custom instruction can be defined as mandatory oroptional.

Preferably, the customisation information further identifies at leastone processor style as a candidate for implementation.

Preferably, the method further comprises the steps of: profilinginformation in the customisation specification; and identifying at leastone processor style as a candidate for implementation.

More preferably, the profiled information includes the applicationsource code.

Preferably, a plurality of processor styles are identified as candidatesfor implementation in the customisation specification profiling step.

In one embodiment ones of the processor styles are identified to executeparts of an application, whereby the application is to be executed bycombined ones of the processor styles.

Preferably, profiling information for enabling optimisation is collectedin the customisation specification profiling step.

Preferably, the profiling information includes frequency of groups ofopcodes.

Preferably, the profiling information includes information regardingoperation sharing.

Preferably, the profiling information includes information regardingoperation parallelisation.

In one embodiment the instruction information analysis step comprisesthe steps of: utilising the profiling information in analysing theinstruction information; and determining the instruction optimisationstherefrom.

Preferably, the instruction optimisations include operationoptimisations.

More preferably, the operation optimisations include operation sharingoptimisations.

More preferably, the operation optimisations include operationparallelisation optimisations.

Preferably, the instruction optimisations include custom instructions.

In one embodiment the instruction information analysis step comprisesthe steps of: identifying candidate instruction optimisations; anddetermining implementation of the instruction optimisations based onestimations performed based on instantiation of the candidateinstruction optimisations.

Preferably, where the estimations provide that the re-programmablehardware cannot be programmed to implement all instructions togetherduring run time, the instruction information analysis step comprises thestep of: grouping combined ones of instructions into sets ofinstructions which can be implemented by re-programming of there-programmable hardware.

In one embodiment the instruction information analysis step comprisesthe steps of: determining a plurality of implementations for differentrun-time conditions, each having instruction optimisations associatedwith the run-time conditions; and generating decision conditioninformation associated with each implementation, which decisioncondition information enables selection between the implementationsdepending on actual run-time conditions.

In one embodiment, where the instruction optimisations cannot provide animplementation which complies with design constraints, the instructioninformation analysis step comprises the step of: invoking thecustomisation specification profiling step to re-profile thecustomisation specification based on analysis information provided bythe instruction information analysis step.

Preferably, the architecture optimisations include pipelining.

Preferably, the architecture optimisations include resource replication.

Preferably, the architecture optimisations include technologyindependent optimisations.

More preferably, the technology independent optimisations includeremoval of unused resources.

More preferably, the technology independent optimisations include opcodeassignment.

More preferably, the technology independent optimisations includechannel communication optimisations.

More preferably, the technology independent optimisations includecustomisation of data and instruction paths.

In one embodiment, where a plurality of configurations of there-programmable hardware are required to implement the instructionprocessor, the instantiation step comprises the steps of: optimisingones of the configurations into groups; and scheduling implementation ofthe grouped configurations.

Preferably, each template is generated from processor definitions andassociated parameters extracted from a library containing processordefinitions and associated parameters for a plurality of processorstyles.

Preferably, the processor styles include superscalar processors.

Preferably, the processor styles include hybrid processors.

In one embodiment the compiler utilised in compiling the applicationsource code is generated in the instruction information analysis step,and the compiling step comprises the steps of: annotating theapplication source code with customisation information; and compilingthe annotated source code to provide an optimised executable code.

In another embodiment the compiling step comprises the steps of:compiling the application source code; and re-organising the compiledsource code to incorporate optimisations to provide an optimisedexecutable code.

Preferably, the method further comprises the step of: deploying theconfiguration information and associated executable code, and, whererelevant, the decision condition information, in at least one managementsystem which is for managing adaptation and configuration of instructionprocessors implemented using re-programmable hardware.

Preferably, the method further comprises the step of: deploying theconfiguration information and associated executable code, and, whererelevant, the decision condition information, in at least one libraryfor enabling re-programming of re-programmable hardware.

In one embodiment the re-programmable hardware comprises at least onefield programmable gate array.

In another embodiment the re-programmable hardware comprises at leastone complex programmable logic device.

Preferably, the instruction processor is fully implemented using there-programmable hardware.

In a further aspect the present invention provides a management systemfor managing run-time re-configuration of an instruction processorimplemented using re-programmable hardware, comprising: a configurationlibrary containing configuration information for a plurality ofinstruction processor implementations; a code library for containingassociated executable code for the implementations; a loader for loadingapplication data and, as required, configuration information andassociated executable code into re-programmable hardware forimplementation and execution of an instruction processor; a loadercontroller for signalling the loader to load application data and, asrequired, configuration information and associated executable code, andexecute the executable code; a run-time monitor for obtaining run-timestatistics relating to operation of the instruction processor duringexecution; an optimisation determiner configured to receive the run-timestatistics, and being operable to instruct the loader to load newconfiguration information and associated executable code for a newimplementation into the re-programmable hardware; and an optimisationinstructor for invoking the optimisation determiner.

Preferably, the system comprises: a run-time manager including theloader controller, the run-time monitor and the optimisation instructor.

In one embodiment the optimisation instructor is configuredautomatically to invoke the optimisation determiner on a predeterminableevent.

Preferably, the event is an instruction in the executable code.

In one embodiment the optimisation instructor is configured to beinvoked by an external agent.

Preferably, the optimisation instructor is configured to be invoked inresponse to an actuation instruction from an external agent.

Preferably, the optimisation determiner is configured to instruct theloader controller to signal the loader to load the new configurationinformation and associated executable code into the re-programmablehardware on invocation of the optimisation instructor by the externalagent.

Preferably, the actuation instruction identifies the implementation tobe implemented using the re-programmable hardware.

Preferably, the configuration information and associated executable codefor a new implementation are loaded into the respective ones of theconfiguration library and the code library prior to invocation of theoptimisation instructor by an external agent, such that theconfiguration information and associated executable code for thatimplementation can be loaded into the re-programmable hardware oninvocation of the optimisation instructor by the external agent.

Preferably, the system further comprises: a decision condition libraryfor containing associated decision condition information for at leastones of the implementations; and wherein the loader is configured toprovide the optimisation determiner with decision condition informationfor a plurality of other implementations for various run-time conditionsof the implementation loaded in the re-programmable hardware, and theoptimisation determiner is configured to profile the decision conditioninformation for the other implementations, determine whether thedecision condition information for any of the other implementations moreclosely fits the run-time statistics, and, where the decision conditioninformation for one of the other implementations more closely fits therun-time statistics, instruct the loader controller to signal the loaderto load the configuration information and associated executable code forthat implementation into the re-programmable hardware.

Preferably, the system further comprises: an adapter for generating oneor more new implementations optimised to the run-time statistics; andwherein the optimisation determiner is configured to instruct theadapter to generate the one or more new implementations.

In one embodiment the adapter is configured to load the configurationinformation and associated executable code for each new implementationinto respective ones of the configuration library and the code library.

In another embodiment the adapter is configured to load theconfiguration information, associated executable code and associateddecision condition information for each new implementation intorespective ones of the configuration library, the code library and thedecision condition library.

Preferably, the optimisation determiner is configured to instruct theloader to load the configuration information and associated executablecode for a new implementation into the re-programmable hardware onsatisfaction of predeterminable criteria.

More preferably, the optimisation determiner is configured to instructthe loader to load the configuration information and associatedexecutable code for a new implementation into the re-programmablehardware where a re-configuration ratio R>1, the re-configuration ratioR being given by the function:

$R = \frac{T_{sw}{\sum\limits_{j = 1}^{n}{C_{{sw},j}F_{j}}}}{{T_{ci}{\sum\limits_{j = 1}^{n}\left( {C_{{ci},j}F_{j}} \right)}} + T_{r}}$

Where: C_(sw,j) is the number of clock cycles to implement a softwarefunction ƒ( );

-   -   T_(sw) is the cycle time for each clock cycle in the clock cycle        number C_(sw,j);    -   C_(ci,j) is the number of clock cycles to implement a custom        instruction;    -   T_(ci) is the cycle time for each clock cycle in the clock cycle        number C_(ci,j); and    -   T_(r) is the time required for re-configuration of the        re-programmable hardware.

In one embodiment the adapter is configured to operate on line.

In another embodiment the adapter is configured to operate off line.

Preferably, the adapter comprises: an analyser for analysing instructioninformation based on the run-time statistics and determining instructionoptimisations; a compiler for compiling the application source code toinclude the instruction optimisations and generate executable code; aninstantiator for analysing architecture information based on therun-time statistics, determining architecture optimisations andgenerating configuration information, preferably domain-specificconfiguration information, including the architecture optimisations; anda builder for generating device-specific configuration information fromthe configuration information including the architecture optimisations.

Preferably, the adapter further comprises: a selector for profiling theconfiguration information and associated code for each candidateimplementation, and selecting one or more optimal implementations basedon predeterminable criteria.

Preferably, the adapter further comprises: a profiler for profilinginformation in a customisation specification and the run-timestatistics, and identifying at least one processor style as a candidatefor implementation; and a template generator for generating a templatefor each processor style identified as a candidate for implementation.

Preferably, the profiled information includes the application sourcecode.

Preferably, the profiler is configured to identify a plurality ofprocessor styles as candidates for implementation.

More preferably, ones of the processor styles are identified to executeparts of an application, whereby the application is to be executed bycombined ones of the processor styles.

Preferably, the profiler is further configured to collect profilinginformation for enabling optimisation.

More preferably, the profiling information includes frequency of groupsof opcodes.

More preferably, the profiling information includes informationregarding operation sharing.

More preferably, the profiling information includes informationregarding operation parallelisation.

Preferably, the analyser is configured to utilise the profilinginformation in analysing the instruction information, and determine theinstruction optimisations therefrom.

Preferably, the instruction optimisations include operationoptimisations.

More preferably, the operation optimisations include operation sharingoptimisations.

More preferably, the operation optimisations include operationparallelisation optimisations.

Preferably, the instruction optimisations include custom instructions.

In one embodiment custom instructions are identified as candidates foroptimisation based on frequency of use.

In another embodiment custom instructions are identified as candidatesfor optimisation based on a decision function D, where the decisionfunction D is given by:

$D = {\max {\sum\limits_{j = 1}^{n}{\frac{T_{sw}C_{{sw},j}F_{j}}{T_{ci}C_{{ci},j}F_{j}}S_{j}}}}$

Where: C_(sw,j) is the number of clock cycles to implement a softwarefunction ƒ( );

-   -   T_(sw) is the cycle time for each clock cycle in the clock cycle        number C_(sw,j);    -   C_(ci,j) is the number of clock cycles to implement a custom        instruction;    -   T_(ci) is the cycle time for each clock cycle in the clock cycle        number C_(ci,j);    -   F_(j) is the number of times a procedure is called; and    -   S_(j) is a binary selection variable, denoting whether the        custom instruction is implemented.

In one embodiment the analyser is configured to identify candidateinstruction optimisations, and determine implementation of theinstruction optimisations based on estimations performed by theinstantiator.

Preferably, where the estimations from the instantiator provide that there-programmable hardware cannot be programmed to implement allinstructions together during run time, the analyser groups combined onesof instructions into sets of instructions which can be implemented byre-programming of the re-programmable hardware.

In one embodiment the analyser is configured to determine a plurality ofimplementations for different run-time conditions, each havinginstruction optimisations associated with the run-time conditions, andgenerate decision condition information associated with eachimplementation, which decision condition information enables selectionbetween the implementations depending on actual run-time conditions.

Preferably, where the instruction optimisations cannot provide animplementation which complies with design constraints, the analyser isconfigured to invoke the profiler to re-profile the customisationspecification based on analysis information provided by the analyser.

Preferably, the architecture optimisations performed by the instantiatorinclude pipelining.

Preferably, the architecture optimisations performed by the instantiatorinclude resource replication.

Preferably, the architecture optimisations performed by the instantiatorinclude technology independent optimisations.

More preferably, the technology independent optimisations includeremoval of unused resources.

More preferably, the technology independent optimisations include opcodeassignment.

More preferably, the technology independent optimisations includechannel communication optimisations.

More preferably, the technology independent optimisations includecustomisation of data and instruction paths.

Preferably, where a plurality of configurations of the re-programmablehardware are required to implement the instruction processor, theinstantiator is configured to optimise ones of the configurations intogroups and schedule implementation of the grouped configurations.

Preferably, the adapter further comprises: a library containingprocessor definitions and associated parameters for a plurality ofprocessor styles; and wherein the template generator is configured togenerate templates from processor definitions and associated parametersextracted from the library.

Preferably, the processor styles include superscalar processors.

Preferably, the processor styles include hybrid processors.

In one embodiment the compiler is generated by the analyser, and theapplication source code is annotated with customisation information forcompilation by the compiler to provide an optimised executable code.

In another embodiment the compiler is configured to compile theapplication source code and re-organise the compiled source code toincorporate optimisations to provide an optimised executable code.

In one embodiment the re-programmable hardware comprises at least onefield programmable gate array.

In another embodiment the re-programmable hardware comprises at leastone complex programmable logic device.

Preferably, the instruction processor is fully implemented using there-programmable hardware.

In a yet further aspect the present invention provides a method ofmanaging run-time re-configuration of an instruction processorimplemented in re-programmable hardware, comprising the steps of:providing a configuration library containing configuration informationfor a plurality of instruction processor implementations; providing acode library for containing associated executable code for theimplementations; loading application data and, as required,configuration information and executable code into re-programmablehardware for implementation and execution of an instruction processor;executing the executable code; obtaining run-time statistics relating tooperation of the instruction processor; and loading new configurationinformation and associated executable code for a new implementation intothe re-programmable hardware.

In one embodiment the loading step is performed automatically on apredeterminable event.

Preferably, the event is an instruction in the executable code.

In one embodiment the loading step is actuated by an external agent.

Preferably, the loading step is actuated in response to an actuationinstruction from an external agent.

More preferably, the actuation instruction identifies the implementationto be implemented using the re-programmable hardware.

In one embodiment the method further comprises the step of: loading theconfiguration information and associated executable code for a newimplementation into the respective ones of the configuration library andthe code library prior to the loading step; and wherein the loading stepcomprises the step of: loading the configuration information andassociated executable code for that implementation into there-programmable hardware on actuation by an external agent.

In another embodiment the method further comprises the steps of:providing a decision condition library for containing associateddecision condition information for at least ones of the implementations;profiling the decision condition information for a plurality of otherimplementations for various run-time conditions of the implementationloaded in the re-programmable hardware; determining whether the decisioncondition information for any of the other implementations more closelyfits the run-time statistics; and wherein, where the decision conditioninformation for one of the other implementations more closely fits therun-time statistics, the loading step comprises the step of: loading theconfiguration information and associated executable code for thatimplementation into the re-programmable hardware.

In a further embodiment the method further comprises the step of:generating one or more new implementations optimised to the run-timestatistics.

In one embodiment the method further comprises the step of: loading theconfiguration information and associated executable code for each newimplementation into respective ones of the configuration library and thecode library.

In another embodiment the method further comprises the step of: loadingthe configuration information, associated executable code and associateddecision condition information for each new implementation intorespective ones of the configuration library, the code library and thedecision condition library.

Preferably, the configuration information and associated executable codefor a new implementation are loaded into the re-programmable hardware onsatisfaction of predeterminable criteria.

More preferably, the configuration information and associated executablecode for a new implementation are loaded into the re-programmablehardware where a re-configuration ratio R>1, the re-configuration ratioR being given by the function:

$R = \frac{T_{sw}{\sum\limits_{j = 1}^{n}{C_{{sw},j}F_{j}}}}{{T_{ci}{\sum\limits_{j = 1}^{n}\left( {C_{{ci},j}F_{j}} \right)}} + T_{r}}$

Where: C_(sw,j) is the number of clock cycles to implement a softwarefunction ƒ( );

-   -   T_(sw) is the cycle time for each clock cycle in the clock cycle        number C_(sw,j);    -   C_(ci,j) is the number of clock cycles to implement a custom        instruction;    -   T_(ci) is the cycle time for each clock cycle in the clock cycle        number C_(ci,j); and    -   T_(r) is the time required for re-configuration of the        re-programmable hardware.

In one embodiment the implementation generating step is performed online.

In another embodiment the implementation generating step is performedoff line.

Preferably, the implementation generating step comprises the steps of:analysing instruction information based on the run-time statistics anddetermining instruction optimisations; compiling the application sourcecode to include the instruction optimisations and generate executablecode; analysing architecture information based on the run-timestatistics and determining architecture optimisations; generatingconfiguration information including the architecture optimisations; andgenerating device-specific configuration information from theconfiguration information including the architecture optimisations.

Preferably, the implementation generating step further comprises thesteps of: profiling the configuration information and associated codefor each candidate implementation; and in response thereto, selectingone or more optimal implementations based on predeterminable criteria.

In one embodiment the implementation generating step further comprisesthe steps of: profiling information in a customisation specification andthe run-time statistics; identifying at least one processor style as acandidate for implementation; and generating a template for eachprocessor style identified as a candidate for implementation.

Preferably, the profiled information includes the application sourcecode.

Preferably, a plurality of processor styles are identified as candidatesfor implementation in the customisation specification profiling step.

More preferably, ones of the processor styles are identified to executeparts of an application, whereby the application is to be executed bycombined ones of the processor styles.

Preferably, profiling information for enabling optimisation is collectedin the customisation specification profiling step.

More preferably, the profiling information includes frequency of groupsof opcodes.

More preferably, the profiling information includes informationregarding operation sharing.

More preferably, the profiling information includes informationregarding operation parallelisation.

In one embodiment the instruction information analysis step comprisesthe steps of: utilising the profiling information in analysing theinstruction information; and determining the instruction optimisationstherefrom.

Preferably, the instruction optimisations include operationoptimisations.

More preferably, the operation optimisations include operation sharingoptimisations.

More preferably, the operation optimisations include operationparallelisation optimisations.

Preferably, the instruction optimisations include custom instructions.

More preferably, custom instructions are identified as candidates foroptimisation based on frequency of use.

Still more preferably, custom instructions are identified as candidatesfor optimisation based on a decision function D, where the decisionfunction D is given by:

$D = {\max {\sum\limits_{j = 1}^{n}{\frac{T_{sw}C_{{sw},j}F_{j}}{T_{ci}C_{{ci},j}F_{j}}S_{j}}}}$

Where: C_(sw,j) is the number of clock cycles to implement a softwarefunction ƒ( );

-   -   T_(sw) is the cycle time for each clock cycle in the clock cycle        number C_(sw,j);    -   C_(ci,j) is the number of clock cycles to implement a custom        instruction;    -   T_(ci) is the cycle time for each clock cycle in the clock cycle        number C_(ci,j);    -   F_(j) is the number of times a procedure is called; and    -   S_(j) is a binary selection variable, denoting whether the        custom instruction is implemented.

In one embodiment the instruction information analysis step comprisesthe steps of: identifying candidate instruction optimisations; anddetermining implementation of the instruction optimisations based onestimations performed based on instantiation of the candidateinstruction optimisations.

Preferably, where the estimations provide that the re-programmablehardware cannot be programmed to implement all instructions togetherduring run time, the instruction information analysis step comprises thestep of: grouping combined ones of instructions into sets ofinstructions which can be implemented by re-programming of there-programmable hardware.

In another embodiment the instruction information analysis stepcomprises the steps of: determining a plurality of implementations fordifferent run-time conditions, each having instruction optimisationsassociated with the run-time conditions; and generating decisioncondition information associated with each implementation, whichdecision condition information enables selection between theimplementations depending on actual run-time conditions.

Preferably, where the instruction optimisations cannot provide animplementation which complies with design constraints, the instructioninformation analysis step comprises the step of: invoking thecustomisation specification profiling step to re-profile thecustomisation specification based on analysis information provided bythe instruction information analysis step.

Preferably, the architecture optimisations include pipelining.

Preferably, the architecture optimisations include resource replication.

Preferably, the architecture optimisations include technologyindependent optimisations.

More preferably, the technology independent optimisations includeremoval of unused resources.

More preferably, the technology independent optimisations include opcodeassignment.

More preferably, the technology independent optimisations includechannel communication optimisations.

More preferably, the technology independent optimisations includecustomisation of data and instruction paths.

Preferably, where a plurality of configurations of the re-programmablehardware are required to implement the instruction processor, theinstantiation step comprises the steps of: optimising ones of theconfigurations into groups; and scheduling implementation of the groupedconfigurations.

Preferably, each template is generated from processor definitions andassociated parameters extracted from a library containing processordefinitions and associated parameters for a plurality of processorstyles.

Preferably, the processor styles include superscalar processors.

Preferably, the processor styles include hybrid processors.

In one embodiment the compiler utilised in compiling the applicationsource code is generated in the instruction information analysis step,and the compiling step comprises the steps of: annotating theapplication source code with customisation information; and compilingthe annotated source code to provide an optimised executable code.

In another embodiment the compiling step comprises the steps of:compiling the application source code; and re-organising the compiledsource code to incorporate optimisations to provide an optimisedexecutable code.

In one embodiment the re-programmable hardware comprises at least onefield programmable gate array.

In another embodiment the re-programmable hardware comprises at leastone complex programmable logic device.

Preferably, the instruction processor is fully implemented using there-programmable hardware.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will now be describedhereinbelow by way of example only with reference to the accompanyingdrawings, in which:

FIG. 1 diagrammatically illustrates a FIP design system in accordancewith a preferred embodiment of the present invention;

FIG. 2 illustrates a skeletal processor template describing a basicinstruction processor, the instantiation of the skeletal processor intoa stack processor, and the Handel-C description of the stack processor;

FIG. 3 illustrates a skeletal processor template for a superscalarprocessor;

FIG. 4 illustrates two possible compilation paths for the compilation ofexecutable FIP code for FIP implementations in accordance withembodiments of the present invention;

FIG. 5 illustrates the performance of various instructions of a JVMimplemented on a Xilinx Virtex XCV 1000 device;

FIG. 6 illustrates the benchmark scores of JVMs, including those of FIPJVMs implemented as embodiments of the present invention;

FIG. 7 graphically illustrates the number of virtex slices required as afunction of the number of access procedures supported for a FIPimplementation in accordance with an embodiment of the present inventionand a direct hardware implementation on a Xilinx Virtex XCV 1000 device;

FIG. 8 diagrammatically illustrates a FIP management system inaccordance with a preferred embodiment of the present invention;

FIG. 9 illustrates the FIP adapter of one embodiment of the FIPmanagement system of FIG. 8;

FIG. 10 illustrates the FIP adapter of another embodiment of the FIPmanagement system of FIG. 8;

FIG. 11 graphically illustrates the influence of various parameters onthe re-configuration ratio R as employed by the FIP management system ofFIG. 8;

FIG. 12 illustrates the sequential implementation of the procedureFFmulx of the AES algorithm in Java opcodes and chained opcodes as anoptimisation;

FIG. 13 illustrates a further optimisation of the procedure FFmulx;

FIG. 14 graphically illustrates the re-configuration ratio R as afunction of the number of 128-bit data blocks encrypted by FIPs inaccordance with embodiments of the present invention implementingparallel custom instructions for different key widths;

FIG. 15 graphically illustrates the relative performance of FIPimplementations for the AES algorithm;

FIG. 16 illustrates the relative speed-ups for the FIP implementationsin FIG. 15;

FIG. 17 illustrates a debug tool in accordance with an embodiment of thepresent invention; and

FIG. 18 illustrates a compiler tool in accordance with an embodiment ofthe present invention.

DETAILED DESCRIPTION I—Design-Time Customisation

FIG. 1 illustrates a FIP design system in accordance with a preferredembodiment of the present invention.

Customisation Specification

In designing a customised FIP implementation for deployment onre-programmable hardware, such as an FPGA, a customisation specification1 is first provided.

The customisation specification 1 includes application information,which includes the application source code, which can be in any ofseveral forms, such as C, Java and data-flow graphs, and customisationinformation, which includes design constraints, such as speed, area andlatency, under possibly variable run-time conditions.

In this embodiment the application information also includes applicationdata, which can include data representative of the data to be operatedupon by the developed FIP, that is, data representative of the run-timeconditions. In a preferred embodiment the application data includes datarepresentative of the entire range of data to be operated upon by thedeveloped FIP. As will become apparent hereinbelow, the provision ofsuch application data enables analysis for a range of run-timeconditions.

In this embodiment the customisation information also includesuser-defined custom instructions, which define, typically by way of astructural definition, possible custom instructions, either as optionalor mandatory instructions.

FIP Profiling

The customisation specification 1 is then subjected to a profilingprocedure by a FIP profiler 2.

In the profiling procedure, the application information, in particularthe application source code, is profiled to identify one or morecandidate FIP styles which would be likely to provide an optimisedimplementation. For example, candidate selection may be made dependingon the style of the application source code. Typically, a stack-basedprocessor is often an efficient processor for descriptions with a largenumber of small procedures in object-oriented style programming. In onepreferred embodiment the profiling procedure compiles the applicationsource code into opcodes for FIP styles. In an alternative embodimentthe user may specify candidate FIP styles directly.

When identifying candidate FIP styles, the profiling procedure canidentify different FIP styles as possibly being suited to the executionof different parts of the application. Thus, the FIP profiler 2 couldpropose single FIP styles for the execution of the entire application,or combined ones of a plurality of different FIP styles for theexecution of the application, which FIP styles would be achieved byre-configuration during run time, where the re-programmable hardwaredoes not allow for the simultaneous configuration of all of the FIPstyles.

In the profiling procedure, profiling information is also collectedabout the structure of the application source code, in particular thefrequency of use of opcodes and groups of opcodes for the FIP styles,and the possible degree of sharing and parallelisation.

Based on the identified candidate FIP styles, the processor definitionsand associated parameters for those FIP styles are extracted from a FIPlibrary 3, together with the resources associated with any user-definedcustom instructions, such as registers and look-up tables.

FIP Template Generation

FIP templates are then generated automatically by a FIP templategenerator 4 for each of the identified candidate FIP styles from theprocessor definitions and associated parameters extracted from the FIPlibrary 3. The FIP templates incorporate the customisation informationfrom the customisation specification 1, such as the user-defined custominstructions.

In this embodiment Handel-C (Version 2.1, [18]), a hardware descriptionlanguage with a C-like syntax, is used to implement the instructionprocessors. It should, however, be understood that the FIP design systemof the present invention is not restricted to Handel-C descriptions, andother hardware description languages could equally be used.

Handel-C enables the entire design process to be maintained at a highlevel of abstraction, which is advantageous both in the design of theFIP implementation and the inclusion of custom instructions. Handel-Calso provides for the rapid prototyping of designs. The presentinvention is directed particularly at the provision of FIPimplementations which are customised for specific applications,particularly lightweight implementations for embedded systems. Using ahigh-level language, such as Handel-C, simplifies the design process byhaving a single abstract description and provides a mechanism fordemonstrating the correctness of the designed FIP implementations [11,19].

FIG. 2 illustrates a skeletal processor template which describes a basicinstruction processor, the instantiation of the skeletal processor intoa stack processor, and the Handel-C description of the stack processor.

In the processor template, the Fetch module fetches an instruction fromexternal memory and sends the same to the Execute module, which Fetchmodule then awaits a signal from the Execute module that the Executemodule has completed updating shared resources, such as the programcounter. Possible parameterisations include the addition of custominstructions, the removal of unnecessary resources, the customisation ofdata and instruction paths, the optimisation of opcode assignments, andvarying the degree of pipelining.

By way of example, one skeletal template for the processor can bedescribed as follows:

// Hardware resources #include program_counter chan fetchchan; intir_fetch, ir_exe; par{ // -- Fetch module -- { // Fetch the instructionpointer to the program counter ir_fetch = Fetch_from_memory(program_counter); // Send previous instruction to the Execute modulefetchchan ! ir_fetch; // Increment the program counterprogram_counter++; } // -- Execute module -- { // Receives instructionfrom the Fetch module fetchchan ? ir_exe; // Decodes and executes therelevant instruction switch (decode (ir_exe)); { // Instructionimplementations } } }

In Handel-C, channel communications are blocked until both sender andreceivers are ready. The instructions ! and ? are used to send andreceive from channels. For example, fetchchan ! ir_fetch will send theinformation contained in ir_fetch through the fetchchan channel.

The above-described template describes a basic instruction processor.Modern instruction processors can incorporate many features to enhanceefficiency. These features include superscalar architecture, pipelining,interrupts and memory management.

FIG. 3 illustrates a skeletal processor template for a superscalarprocessor.

This processor template comprises a plurality of Execute modules whichare interconnected via communication channels in order to maintain theintegrity of shared resources.

As will be understood, a superscalar processor provides for theconcurrent utilisation of multiple resources. In order to supportsuperscalar architectures, the processor template framework has toprovide for the necessary scheduling. Scheduling of instructions canoccur either at compile time or dynamically at run time. Wherescheduling occurs at compile time, the associated compiler for theprocessor would be responsible for scheduling. Otherwise, wherescheduling occurs dynamically at run time, the Fetch module wouldincorporate a scheduling algorithm.

This processor template also provides a platform for creating hybridprocessors. As will be understood, hybrid processors provide for theability to execute more than one style of instructions. Current complexprocessors can often be considered as hybrid processors. Intel® x86processors, for example, employ a register-based approach for mostinstructions, while floating-point instructions operate on a stack. Inthe present invention, hybridisation provides a means of combining theadvantages of various processor styles into a FIP implementation.

It is well known that the instructions for different processor styleshave different characteristics. For instance, register-based processorstend to have longer instructions and require more program instructions,as compared to the instructions for stack-based processors. Also,register-based instructions allow parallelism to be exploited moreeasily, whilst stack-based instructions tend to have more dependenciesand often run sequentially.

The possibility of combining multiple instruction formats into a singlehybrid FIP implementation allows for a trade-off between speed and codesize, which may be important for embedded devices with limited storage.

The binary description given to a hybrid FIP implementation may containinstructions packed in the styles of different processors. The Fetchmodule of such a hybrid FIP implementation would incorporate anadditional level of decoding to determine the appropriate style, andchannel the instruction to the corresponding Execute module. Forexample, a hybrid processor may contain both a MIPS Execute module and aTIM Execute module, composed in the same manner as superscalarprocessors. This hybrid FIP would run MIPS code, but be augmented by theability to support functional languages.

It is also possible to generate multiple processor systems. In suchsystems, different instruction streams would feed into each individualFIP, which FIPs may communicate with one another via channels.

As mentioned hereinabove, other advanced processor modules, such asmodules for pipelining, interrupt handling and memory management, canalso be incorporated into the FIP implementation in a similar way, withthe modules interfacing with a standard template using channels.Pipeline communications can be simplified where it is known that ahazard will not arise. Profiling the application domain can provide thisinformation. Speculative execution can also be supported bysimultaneously executing both paths of a branch until a guard conditionis determined.

At design time, initially no instructions exist in an Execute module. Asinstructions are added into the Execute module, a counter in the designsystem is incremented to keep track of the number of instructions thathave been added. An opcode file is also generated which provides themapping between the opcode and its binary representation. In thesimplest case, the binary representation of an instruction is thecounter number. However, other number schemes could be employed, such asone-hot encoding.

By way of example, the opcode file can take the following form:

#define POP 1 #define PUSH 2 #define MUL 3

Instructions are easily incorporated into an Execute module using#include declarations. By way of example, a stack-based multiplyinstruction can be included within the switch block of an Execute moduleas:

#include stack_mul.hc

Where, by way of example, the contents of stack_mul.hc could be writtenas:

// Read top of stack iData1 = readTOS( ); par{ // Multiply next item instack to iData iData1 *= readStack2( ); // Adjust the stack pointersp−−; } par { // Write data back into stack writeStack(sp−1,iData1); //Flag to denote that an instruction has completed execution inst_done =1; }

Custom instructions can also be included in this way by including theinstructions in an Execute module, and providing the hardwareimplementation in a separate file.

The FIP templates are then each processed as follows, with theinstruction information for each FIP implementation being subjected to aFIP analysis procedure by a FIP analyser 5 and the architectureinformation for each FIP implementation being subjected to a FIPinstantiation procedure by a FIP instantiator 6.

FIP Analysis

In this embodiment the FIP analyser 5 includes a pre-compiler for thepre-compilation of the application source code; this pre-compilationsimplifying the subsequent analysis of the instructions. The output ofthe pre-compilation can be any intermediate representation, such asdata-flow graphs or opcodes. In an alternative embodiment thepre-compiler can be omitted.

For each candidate FIP template, the FIP analyser 5 analyses theprofiling information, as obtained by the FIP profiler 2, to identifycandidate instruction optimisations, that is, candidate optimisations ofoperations, typically the sharing and parallelisation possibilities, andoperations as candidates for custom instructions, such as by theidentification of groups of frequently-used opcodes.

The resulting candidate instruction optimisations are then passed to theFIP instantiator 6 which effects estimations, such as the speed, sizeand latency, for those optimisations and returns the estimations to theFIP analyser 5.

Based on the estimations received from the FIP instantiator 6, the FIPanalyser 5 evaluates the estimations to determine which of the candidateinstruction optimisations, that is, operation optimisations and custominstructions, should be incorporated. In one embodiment the FIP analyser5 evaluates the estimations to determine whether design constraints, forexample, as given in the customisation specification 1, are met.

In one embodiment the evaluation by the FIP analyser 5 determineswhether the instability in run-time conditions, as represented by theapplication data in the customisation specification 1, is such that aplurality of FIP implementations are required, each having differentinstruction optimisations, in order to provide for optimal performanceof the application. That is, a plurality of FIP implementations aredetermined, each being associated with particular run-time conditions.In this embodiment each of the FIP implementations has associateddecision condition information 7, which, as will be describedhereinbelow, enables a FIP management system subsequently to selectbetween the developed FIP implementations depending on the actualrun-time conditions. Where the run-time conditions are extremely stable,for example, in operating on data sets of very similar format, the FIPanalyser 5 would probably determine only a single FIP implementation asbeing necessary.

Where constraints, such as speed or area, make it impossible toimplement all instructions in a single FIP configuration, the FIPanalyser 5 can group ones of the instructions into different groups, sothat only the relevant instruction groups are implemented at aparticular time during run time. There are several ways in which thisgrouping can be implemented. In one implementation, each FIPimplementation would have the same number of instructions, the onlydifference being that some of the instructions would be nullinstructions, and instead of containing a large hardware description maycontain only a small hardware description, requiring that another FIPimplementation be loaded. Another implementation is to let the run-timehost schedule the re-configuration. In this case, each FIPimplementation would still have the same number of instructions, butsome of the instructions may be implemented with less area and morelatency, for example, in digit serial form. The run-time host can thendecide whether it is more efficient to re-configure to a faster FIPimplementation, even if the re-configuration time is longer than using aslow operation.

Where the resulting FIP implementations do not comply with predeterminedconstraints, such as specified in the customisation specification 1, theFIP analyser 5 can be configured, as in this embodiment, to invoke theFIP profiler 2 further to re-profile the customisation specification 1,based additionally on analysis information as provided by the FIPanalyser 5, such as to provide modified instruction information,typically by way of providing instruction information for another,possibly-related, FIP style.

FIP Instantiation

For each FIP implementation as determined by the FIP analyser 5, the FIPinstantiator 6 develops the one or more FIP hardware configurationswhich are required to perform the application instructions. The FIPinstantiator 6 is also configured to optimise the processorarchitecture.

As discussed hereinabove, a plurality of FIP hardware configurations aredeveloped where the re-programmable hardware cannot accommodate all ofthe required FIP hardware configurations at a particular time, with thedifferent FIP hardware configurations being re-programmed on thehardware at different times. Where a plurality of FIP hardwareconfigurations are required, the FIP instantiator 6 is configured bothto optimise the construction and scheduling of the FIP hardwareconfigurations, for example, in order to minimise re-configuration.

In this embodiment optimisations include congestion mitigation. Where aresource is used too often, routing congestion in the re-programmablehardware 22 will cause the FIP to slow down and take up more area. TheFIP instantiator 6 is configured to detect this condition and invoketechniques to mitigate the problem, such as pipelining the routing ordecoding mechanism, and replicating appropriate resources. Whereresources are replicated, the FIP analyser 5 is instructed to create newinstructions to access the additional resources.

In this embodiment optimisations also include technology independentoptimisations, such as removing unused resources, opcode assignment,channel communication optimisations, and customising data andinstruction paths according to the specific domain. The size and cachingstrategy for memory caches or the garbage collection strategies can alsobe customised.

The FIP instantiator 6 then generates FIP configuration information 8for each FIP implementation. The FIP configuration information 8 may bespecific to a given domain of application, such as image processing.

FIP Building

The domain-specific FIP configuration information 8 for each FIPimplementation is then subjected to technology specific optimisations,such as resource binding, constraint satisfaction and instantiation ofvendor specific macros, by a FIP builder 9 to provide device-specificFIP configuration information 10, that is, configuration information forthe specific FIP implementation and use with specific re-programmablehardware. In particular, any special resources which are available, suchas fast-carry chains, embedded memory, etc, are deployed.

The FIP instantiator 6 is iteratively and/or interactively employed ifconstraints cannot be satisfied.

FIP Selection

The source code, as developed by the FIP analyser 5, and the FIPconfiguration information 10, as developed by the FIP instantiator 6,for the one or more FIP implementations for each FIP template are thenprofiled by a FIP selector 11 to select the optimal FIP style, havingone or more FIP implementations as described hereinabove, based onpredetermined criteria, which criteria usually involve a trade-offbetween speed and size.

FIP Compilation

Following the selection of one or more FIP implementations by the FIPselector 11, the source code for those implementations is then compiledby a FIP compiler 12 to provide an executable FIP code 13.

FIG. 4 illustrates two possible compilation paths for the compilation ofthe executable FIP code 13.

In this embodiment, the left-hand compilation path of FIG. 4, the sourcecode is annotated with information relating to instructionoptimisations, such as the frequency of use of instructions and groupsof instructions, and shared resources. This step transforms standardsource code into source code which includes specific information whichis utilised in the optimisation of both the compiled code and the FIPimplementation. The advantage of this compilation technique is that noinformation is lost during the design flow, enabling the optimisationprocess to be as effective as possible.

In this embodiment the FIP compiler 12 and related tools are generatedby the FIP analyser 5, and thus are FIP specific. In this way, thesource code, in this embodiment annotated with instructionoptimisations, to be utilised by the FIP-specific compiler 12, can becompiled into an executable FIP code 13 and control mechanisms, in apreferred embodiment software control mechanisms, for determining andcommissioning at run time the instructions for this FIP implementationoptimised for specific durations at run time. The related tools includeassemblers, linkers, disassemblers, debuggers, instruction setsimulators and other facilities for generating appropriate executableFIP code 13 and optimising performance, reducing size, reducing powerconsumption, etc. It will be noted that in this embodiment the FIPcompiler 12 and the related tools are automatically generated by the FIPanalyser 5, for example, based on the requirements of the selected FIPtemplates and associated parameters.

In an alternative embodiment, the right-hand compilation path of FIG. 4,an available compiler is utilised to compile the source code. Thiscompiler can be a standard compiler or a compiler created from aprevious FIP implementation. In this compilation technique, the compiledcode is evaluated to determine possible optimisations, and re-organisedto exploit instruction-level parallelism and other optimisations. Thisis similar to the idea of just-in-time compilation (JIT) for JVMs. Theadvantage of this technique is that existing compilers can be used andpre-compiled code can execute on the FIP. However, since it is oftendifficult to identify possible optimisations in compiled code, thisapproach may yield a less optimum solution than using a FIP-specificcompiler.

FIP Implementation

The resulting FIP configuration information 10, which can define morethan one FIP configuration so as to enable run-time re-configuration,and executable FIP code 13 provide an optimised FIP implementation forthe required application. Where a plurality of FIP implementations aredeveloped to provide for optimal performance for varying run-timeconditions, each FIP implementation has associated decision conditioninformation 7, as described hereinabove, to enable selection between theFIP implementations depending upon the run-time conditions. Incrementalconfigurations to convert one customised hardware implementation toanother can also be created, if appropriate [16].

The developed FIP implementation can then be deployed directly into aFIP management system 14, or alternatively, or in addition, into FIPlibraries for subsequent loading into re-programmable hardware. Where aplurality of FIP implementations are developed, one of the FIPimplementations is selected for initial deployment, for example, basedon providing average performance. Indeed, it is envisaged that the FIPconfiguration information 10 and the executable FIP code 13, and, whererelevant, the decision condition information 7, for the developed FIPimplementation could be transferred to a centralised library forsubsequent downloading or directly to systems which execute similarapplications, thereby providing a means for upgrading those othersystems.

Also, in this embodiment, the FIP style and associated parameters forthe FIP configuration information 10 for each developed FIPimplementation are loaded into the FIP library 3, thereby expanding thenumber of FIP styles contained in the FIP library 3.

FIP Optimisations

As described hereinabove, optimisations for FIP implementations canoccur at two levels, that is, both the software and the processor can beoptimised. Advances in optimising compilers and instruction processordesigns can be adapted for use in FIP architectures and compilers.Modification of these techniques for use with FIP systems will bedescribed hereinbelow.

Optimisations can be broadly categorised into four groups:

Technology Independent

-   -   Removal of unused resources and instructions    -   Customisation of datapaths and instructions    -   Optimisation of opcode assignments    -   Optimisation of channel communications between modules

Technology Dependent (Typically for FPGA Implementation)

-   -   Deployment of available special resources such as fast-carry        chains, embedded memory, etc    -   Introduction of congestion management to reduce routing delays        due to routing

Processor Style Specific

-   -   Processor type, such as JVM, MIPS, TIM Superscalar architecture,        pipelining, etc

Compiler Specific

-   -   Instruction level parallel scheduling    -   Opcode re-ordering    -   Loop unrolling and folding    -   Predicated execution

Some of these optimisations have already been developed [11]. Thefollowing describes custom instructions and technology dependentoptimisations.

Direct hardware implementations of specific data paths can beincorporated into FIP hardware configurations for activation by custominstructions. This improves performance as it reduces the number offetch and decode instructions. However, the more custom instructions,the larger the FIP hardware configuration. Hence, the improvement inspeed is accompanied by an increase in size. The choice of the type andthe number of custom instructions is important. This selection shouldalso depend on how frequently a particular custom instruction is used.The trade-offs will be described in more detail hereinbelow.

Furthermore, it is possible to optimise away the fetch and decode stagealtogether, leaving only the data path, thereby effectively giving adirect hardware implementation, akin to the hardware implementationdescribed in relation to FIG. 7. With this configuration, problemsassociated with memory bottlenecks can be obviated.

Optimisations specific to certain processor styles are also possible.These are often related to device dependent resources. For example in aJVM, if multiple banks of memory exist, stack access could be enhancedso that the top two elements of a stack can be read concurrently. Devicedependent resources can be exploited by using technology-specifichardware libraries [20] and vendor provided macros, such as RelationallyPlaced Macros [21] as provided by Xilinx Inc. and Megafunctions [22] asprovided by Altera Corporation.

In FPGAs, unlike ASICs, registers are abundant, but routing can incur alarge delay penalty, as well as increase the size of a design. Thisfeature of FPGAs places restrictions on template designs. Routingcongestion occurs when a resource is used extensively by manyoperations. Criteria such as the size of the resource or the routingdensity of neighbouring modules may also effect the routing of a FIPhardware configuration. Three design solutions are presented herein. Thefirst and simplest design solution is to pipeline the routing. Thesecond design solution is to arrange the decoding network, whichcontrols the activation of a resource, as a pipelined tree. This resultsin a shorter cycle time and a smaller logic-to-routing delay ratio, butat the expense of larger area and more complex circuitry. The thirddesign solution is to replicate the resources. Resources should only beshared where it is beneficial to do so. For example, instructionsfrequently require temporary registers for intermediate results, sosharing of those resources is inefficient. For shared operations, areaand speed can be traded-off against latency. For instance, if the sharedresource is a single-cycle multiplier, it can be replaced by severaldigit-serial multipliers, where parallel to serial converters are placedat locations to reduce the routing congestion. However, if thereplicated resource is a shared storage, care needs to be taken toensure the consistency of the state information.

FIP JVM and MIPS Implementations

Operation of the FIP design system will now be described hereinbelow byway of example with reference to FIP JVM implementations [23], and theperformance of those implementations compared against software and ASICimplementations. The performance of a FIP implementation of a MIPS styleprocessor will also be discussed.

The FIP JVM implementations have been developed based on the JVMspecification. Many parameterisations and optimisations have beeninvestigated, including the removal of unnecessary resources, thecustomisation of data and instruction paths, the optimisation of opcodeassignments, and the variation of the degree of pipelining. Theabove-described customised JVMs have been implemented using theRC1000-PP device (Celoxica Limited, UK).

In a first embodiment a FIP JVM has been developed which utilises sharedsegregated resources. This embodiment provides good area utilisation,but at the expense of speed, because of routing congestion.

In a second embodiment a FIP JVM has been developed which utilises twostages of pipelining and only shares irreplaceable resources, such asthe stack and main memory. Stack-based processors are intrinsicallysequential. Speed optimisation of the FIP JVM introduces parallelismwhich is manifested as register-style instruction implementations.

In a third embodiment a FIP JVM has been developed which incorporatesdeeper pipelines for certain instructions and ‘register’ styleimprovements, such as having top-of-stack registers. The top-of-stackregisters are replicated. Instructions can be read from differenttop-of-stack registers, but are written back to the stack directly.Replicated registers are updated during the fetch cycle. Mostinstructions are processed by four pipeline stages, although certaininstructions, such as the instruction for invoking functions, requiredeeper logic and the implementation of those instructions has beenpartitioned into five or six pipeline stages. Routing has also beenpipelined to reduce the effects of congestion.

These FIP JVM embodiments demonstrate trade-offs between possibleparameterisations.

Maximising sharing methods for re-programmable hardware throughconventional resource sharing may introduce significant routingoverheads. Congestion management is necessary to identify the optimaldegree of sharing when the amount of routing begins to dominate theimplementation medium.

Pipelining is useful for reducing clock cycle time. However, resourcessuch as stacks may have operation dependencies which limit the amount ofoverlapping between instructions, and also introduce latency whenpipelined.

For the following evaluations, the above-described third embodiment ofthe FIP JVM is utilised, with the current program counter and data pathsize being 32 bits. As illustrated in FIG. 5, the theoretical upperbound for this implementation is predicted to be roughly 80 MHz, whenonly the NOP instruction is supported. This demonstrates that thefetch-decoding structure is reasonably efficient. The clock speed couldbe further increased by reducing the program counter size or improvingthe adder.

The performance of the FIP JVM is compared with a JVM running on anIntel® processor (Pentium® II at 300 MHz) and an ASIC Java processor(GMJ30501SB at 200 MHz from Helborn Electronics, Segyung, Korea). TheGMJ30501SB is based on the picojava 1 core [24] from Sun Microsystems.The CaffineMark 3.0[25] Java benchmark has been used to measureperformance. The CaffineMark 3.0 benchmark is a set of tests used tobenchmark performance of JVMs in embedded devices. These include testson the speed of Boolean operations, execution of function calls and thegeneration of primes.

FIG. 6 illustrates the benchmark scores of the FIP JVM, together withthose achieved with JVM software running on the Intel® processor and theASIC Java processor. The FIP JVM implementation compares favourably withthe software implementation, and a version with a deeper pipeline isestimated to run seven times faster. While the ASIC Java processorachieves the fastest speed, there are two significant factors to beborne in mind. Firstly, the ASIC Java processor is running at 200 MHz,compared to the FIP JVM at 33 MHz. Secondly, the ASIC Java processor hasfixed instructions, while the FIP JVM enables the incorporation ofcustom instructions by re-configuration. The speed-up provided by theFIP JVM is expected to increase towards that shown by the ASIC Javaprocessor as more custom instructions are added. In the following, thetrade-offs concerning providing custom instructions are demonstrated.

Link lists may be used to organise e-mails or phone numbers in embeddedsystems, such as cell phones. Direct hardware implementations, that is,implementations without fetching and decoding instructions, have beendeveloped to manipulate a link list structure having separate circuitssupporting different access procedures, such as inserting a link andsearching for a value. These direct implementations are clocked at 40 to70 MHz, and can be incorporated as data paths for custom instructions inthe FIP JVM.

An insertion sort algorithm has been written using both the directhardware approach and the FIP JVM approach for comparison. The directhardware implementation takes 2.3 ms to sort a list of 100 links, whilethe FIP JVM takes 6.4 ms and the ASIC JVM is estimated to take 1 ms. Theinsertion of a link into the list takes 22 Java instructions.

By including a custom instruction to insert a link, the execution timefor the FIP JVM can be reduced to 5 ms, since the single custominstruction takes 12 cycles to complete. There is a saving of 10 cycles,and 10 fetch and decode cycles saved per instruction. It will be notedthat a custom instruction requiring fewer cycles to execute can beutilised, but the cycle time could be longer. If two custom instructionswere added, the execution time would be reduced to 3.1 ms. However, theaddition of custom instructions not only speeds up the application, butalso increases the size of the FIP hardware configuration. The trade-offof using another FIP implementation will be considered hereinbelow.

For purposes of comparison, a MIPS-style FIP which can be clocked at 30MHz was developed. Two kinds of comparisons were undertaken.Device-independent comparisons look at the number of gates, registersand latches used. Device-dependent comparisons look at the number ofXilinx Virtex slices used.

FIG. 7 illustrates the trade-offs between a fully-shared FIPimplementation and a direct hardware implementation.

In general, the direct hardware implementation executes in fewer cyclesand can be clocked at a higher frequency than FIP implementations. Forinstance, an insert instruction takes 12 cycles at 39 MHz in the directhardware configuration, as compared to 22 cycles at 30 MHz in a FIP. Thedirect hardware implementation takes 2.3 ms to sort a list of 100 links,whereas the FIP takes 7.1 ms. However, the FIP uses only 290 Virtexslices, as compared to 460 slices used by the direct hardwareimplementation.

As will also be noted from FIG. 7, the FIP implementation is smallerthan the direct hardware implementation for applications involving fiveor more access procedures. The cross-over point provides a means ofestimating when it is no longer beneficial to include more custominstructions. As more custom instructions are added to the FIPimplementation, the cross-over point will shift upwards.

The FIP implementation of the present invention is thus efficient andprovides a good mechanism for resource sharing. The execution speed ofthe FIP could be improved by incorporating custom instructions, but thiscould be at the expense of size. Furthermore, device-independent resultscan be utilised to estimate the number and type of custom instructionsin a FIP implementation. This provides a means of automating theoptimisation of resource sharing. As sharing increases, the amount ofrouting congestion will also increase, since a larger number ofinstructions in a FIP implementation may result in more congestion.Custom instructions reduce the number of instructions, hence increasingthroughput and reducing congestion.

In summary, the FIP design system of the present invention provides aframework for the systematic customisation of programmable instructionprocessors. The FIP approach enables rapid development of instructionprocessors by parameterising, composing and optimising processortemplates. Furthermore, either a standard compiler or a FIP-specificcompiler can be used in the implementation process.

II—Run-Time Adaptation

FIG. 8 illustrates a FIP management system in accordance with apreferred embodiment of the present invention.

As described hereinabove, the FIP design system of the present inventionprovides the FIP configuration information 10 and associated executableFIP code 13, and, where relevant, the associated decision conditioninformation 7, for FIP implementations, and also the correspondingFIP-specific compilers 12. The design environment also generates theinitial run-time environment. Users can determine the capability of therun-time environment at compile time. For example, the user can decidewhether full re-configuration or automatic refinement is required duringrun-time. This will determine the complexity of the run-timeenvironment.

The FIP management system includes a run-time manager 18, which is thecentral hub of the system.

FIP Execution

The run-time manager 18 includes a FIP loader controller 19 forcontrolling a FIP loader 20.

When instructed to execute an application, the FIP loader controller 19signals the FIP loader 20 to load application data 21 which is to beoperated upon by the FIP, and, as required, FIP configurationinformation 10, executable FIP code 13, and, where relevant, decisioncondition information 7, for a FIP implementation, into re-programmablehardware 22, in this embodiment an FPGA, and then execute the FIP code13 to provide application results 24. Where the required FIPimplementation is already loaded in the re-programmable hardware 22, theFIP loader 20 loads only the application data 21.

The FIP configuration information 10, associated executable FIP code 13,and associated decision condition information 7 are obtained fromrespective ones of a FIP configuration information library 25, anexecutable FIP code library 26, and a decision condition informationlibrary 27. The FIP configuration information library 25 contains aplurality of FIP configuration information files, each for configuringthe re-programmable hardware 22 to perform a custom application on theexecution of the associated FIP code 13. The executable FIP code library26 contains a plurality of executable FIP code files, each beingassociated with a respective one of the FIP configuration informationfiles. The decision condition information library 27 contains aplurality of decision condition information files, each being associatedwith a respective one of the FIP configuration information files. Inthis embodiment the libraries 25, 26, 27 are local components of thesystem, but in an alternative embodiment could be located remotely fromthe system and the FIP configuration information 10, associatedexecutable FIP code 13 and associated decision condition information 7for a FIP implementation downloaded as required.

Run-Time Monitoring

The run-time manager 18 further includes a run-time monitor 28, which,during execution of the FIP code 13, obtains run-time statistics 29relating to the operation of the re-programmable hardware 22, that is,statistics relating to the run-time conditions, such as the number oftimes each procedure is called, the most frequently used opcodes, andthe value of the program counter (PC) to determine execution locality.

In this embodiment the run-time monitor 28 collects run-time data andgenerates the run-time statistics 29 concurrently with execution of theFIP in order not to impact on the performance of the FIP. Suchcollection and analysis can be implemented in hardware, such as by anASIC, or software, such as on a personal computer or a programmablesystem-on-chip device [4].

The frequency with which such statistics are collected is pertinent[26]. A short sampling period may yield results that do not accuratelyreflect representative characteristics of an application, whereas a longsampling period may require large amounts of storage space and have animpact on the execution of the application.

The FIP templates utilised in the design of the FIP implementations ofthe present invention allow for the ready incorporation of statisticmonitoring modules. In this embodiment information is collected on thefrequency of procedure calls, sampled over the run time of the givenapplication.

Optimisation Determination I

The run-time manager 18 further includes an optimisation instructor 30for instructing an optimisation determiner 31. The optimisationinstructor 30 can be configured to actuate the optimisation determiner31 automatically, such as in response to an instruction in theexecutable code, or in response to an actuation instruction 32 from anexternal agent, such as a user.

The optimisation determiner 31 receives the run-time statistics 29 fromthe run-time monitor 28, and is operable to instruct the FIP loadercontroller 19 to signal the FIP loader 20 to load the FIP configurationinformation 10 and associated executable FIP code 13 for a new FIPimplementation into the re-programmable hardware 22.

In one mode of operation, where the optimisation instructor 30 isactuated by an actuation instruction 32 from an external agent and theactuation instruction 32 identifies which FIP implementation is to beimplemented, the optimisation determiner 31 instructs the FIP loadercontroller 19 directly to signal the FIP loader 20 to load the FIPconfiguration information 10 and associated executable FIP code 13 for anew FIP implementation into the re-programmable hardware 22. Typically,in one embodiment, the FIP configuration information 10, associatedexecutable FIP code 13, and, where relevant, associated decisioncondition information 7 for one or more new FIP implementations could beloaded into the libraries 25, 26, 27 in anticipation of performing a newcustom application, and the optimisation instructor 30 instructed by anexternal agent to load one of the new FIP implementations.

The optimisation determiner 31 is further configured to determinewhether a different FIP implementation would provide for improvedrun-time performance of the implemented application under the currentrun-time conditions, these conditions being represented by the run-timestatistics 29, such as to enable a new, optimised FIP implementation tobe loaded into the re-programmable hardware 22.

In another mode of operation, where the libraries 25, 26 already containFIP configuration information 10 and associated executable FIP code 13for FIP implementations of the given application under various run-timeconditions, the optimisation determiner 31 simply profiles the decisioncondition information 7 for those FIP implementations, as available viathe run-time manager 18 and the FIP loader 20, and determines whetherthe decision condition information 7 for any of those FIPimplementations more closely fits the obtained run-time statistics 29.Where the decision condition information 7 for one of those FIPimplementations more closely fits the obtained run-time statistics 29,the optimisation determiner 31 instructs the FIP loader controller 19 tosignal the FIP loader 20 to load the FIP configuration information 10and associated executable FIP code 13 for that FIP implementation fromthe libraries 25, 26 into the re-programmable hardware 22. As nocompilation of source code or instantiation and building of architectureinformation is required, and the fact that the FIP implementation wasdeveloped so as to provide optimised performance for similar run-timeconditions, the implementation of the new FIP is essentiallyinstantaneous and no determination as to the benefit of re-programmingthe re-programmable hardware 22 has to be performed.

In a further mode of operation, the optimisation determiner 31 isconfigured to instruct a FIP adapter 33 to generate a new FIPimplementation which is more optimally optimised to the run-timeconditions for the given application.

In a yet further mode of operation, it is possible to optimise away thefetch and decode stage altogether, leaving only the data path, therebyeffectively giving a direct hardware implementation, akin to thehardware implementation described in relation to FIG. 7. With thisconfiguration, problems associated with memory bottlenecks can beobviated.

FIP Adaptation

In one embodiment, as illustrated in FIG. 9, the FIP adapter 33comprises the components of the above-described FIP design system, asillustrated in FIG. 1, with the run-time statistics 29 being provided asapplication information to the FIP profiler 2 to enable optimisedprofiling of the customisation specification 1. In order to avoidunnecessary duplication of description, reference is made to the earlierdescription of the FIP design system.

In another embodiment, as illustrated in FIG. 10, the FIP adapter 33comprises components of the above-described FIP design system, asillustrated in FIG. 1. In order to avoid unnecessary duplication ofdescription, only the differences between the FIP adapter 33 and theabove-described FIP design system will be described in detail, andreference is made to the earlier description of the FIP design system.The FIP adapter 33 differs from the above-described FIP design system inthat the FIP profiler 2, the FIP library 3 and the FIP templategenerator 4 are omitted, and in that the run-time statistics 29 areprovided as application information to the FIP analyser 5 to enable theidentification of instruction customisations for the implemented FIP.The FIP adapter 33 of this embodiment represents a simplified version ofthe FIP adapter 33 of the above-described embodiment, and whilst notbeing as versatile, offers the advantage of providing for fasteradaptation of FIP implementations.

The strategy for adaptation can be influenced by a number of factors,for example, depending on how fast environmental conditions change.Adaptation can be on-line or off-line. On-line adaptation encompassesadapting a FIP implementation for immediate re-configuration and loadingthe resulting FIP configuration information 10, associated executableFIP code 13, and, where relevant, the decision condition information 7in the libraries 25, 26, 27 thereby enabling re-programming of there-programmable hardware 22 in response to quite rapidly-changingenvironmental conditions. In this system, the FIP adapter 33 wouldusually be situated closely to the re-programmable hardware 22. Off-lineadaptation encompasses adapting a FIP implementation for less immediatere-programming of the re-programmable hardware 22. Such adaptation wouldbe typically where the environmental conditions remain stable orunchanged for relatively long periods of time. In this system, the FIPadapter 33 could be located remotely from the re-programmable hardware22.

As for the above-described FIP design system, the FIP analyser 5 of theFIP adapter analyses the run-time statistics 29, typically the frequencyof use of certain native or custom instructions, and determines possibleoptimisations, typically in determining the optimal trade-off for agiven situation, for instance, in determining the smallest availablehardware area at a given speed of execution. Based on the run-timestatistics 29, more resources can be dedicated to functions that areused most frequently, for example, by the creation of custominstructions for certain of those functions, or perform various otheroptimisations, such as using faster operations for more frequently-usedinstructions, or changing instruction cache size or stack depth. By wayof example, the performance of a frequently-used multiplier circuitcould be increased, while reducing the area and performance of the lessfrequently-used operations. The present invention provides a level offlexibility in the optimisation analysis because of the domainspecificity.

Custom instructions are created according to the results of the aboveanalysis. Once custom instructions have been generated, the FIP analyser5 analyses the resulting instructions. This analysis is necessarybecause, for example, the customisation request may contain too many newcustom instructions, and may not satisfy area or latency constraints. Inthat case, the FIP analyser 5 determines whether to remove custominstructions, reduce the number of native opcodes supported, ordowngrade the performance of the less-frequently used opcodes.

One way of implementing new custom instructions on-line during run timeis by using look-up tables. For example, the libraries 25, 26, 27 caninclude pre-compiled FIPs with custom instructions that take 1, 2 or 3inputs and give an output. At run time, using appropriate tools such asthe “JBits” and “JRTR” tools from Xilinx Inc. [27], the relevant look-uptable codes can be uploaded relatively efficiently using partialrun-time re-configuration. In one embodiment this approach would utiliseinformation that certain styles of functions are used more frequently,but not necessarily precisely which function. For example, differenttrigonometric functions can be supported by re-configuring the look-uptables.

Particular optimisations performed by the FIP adapter 33 in thegeneration of custom instructions include opcode chaining, instructionfolding and resource replication. These optimisations will be describedin more detail hereinbelow.

Opcode Chaining

The concept of opcode chaining is to connect the sequence of opcodesthat make up a procedure. This is similar to the concept of microcode ina RISC machine. Chaining reduces the time spent on fetching and decodinginstructions. Further, by converting a procedure call to a singleinstruction, the overhead of calling a procedure can be avoided; suchoverheads include pre-amble and post-amble housekeeping routines, suchas storing program counters, shared registers and refilling pre-fetchbuffers.

Instruction Folding

Instruction folding allows several opcodes to be executed in parallel.For instance, up to four Java opcodes can be processed concurrently[28]. By way of example, in stack-based machines, addition requires twovalues to be pushed on the stack, and the results may have to be storedafter the instruction. A register-based processor can perform all fouroperations in one cycle if the values have been loaded into the registerfile. Since operations are chained directly together, there is no needto load values into the register file. The values for an operation wouldhave been loaded as a result of a previous operation.

Resource Replication

Replication of resources enables the utilisation of some level ofparallelism which may have previously been hidden. There are twopossible approaches, these being data level parallelism and instructionlevel parallelism [29]. Data level parallelism exploits dataindependence by performing as many operations as possible in oneinstruction. Instruction level parallelism involves concurrentoperations of independent instructions where processed by differentexecution units. When creating custom instructions, the availableresources are expanded as needed to exploit as much data level orinstruction level parallelism as possible.

There are other optimisations which can be employed. These includedevice-specific optimisations, such as using look-up tableimplementations. These optimisations exploit high block RAM content, asin the Xilinx Virtex-E (Xilinx Inc.). Also, by moving registers aroundand removing redundant registers, the overall cycle count of a custominstruction can be reduced. Instruction processors have rigid timingcharacteristics. The clock period of an instruction processor isdetermined by the critical delay path, and this means that simpleinstructions, such as bit manipulation, bit-wise logic and load/storeoperations, will take at least one cycle to execute. Other possibleoptimisations relate to reducing the overheads of run-timere-configuration by reducing the amount of configuration storagerequired to store multiple FIP designs and the time taken tore-configure between these FIP designs [16].

In this embodiment the most frequently used opcodes and groups ofopcodes are identified as candidates for optimisation. A large range ofpossible candidates are thus likely to be identified for creating custominstructions.

In one embodiment, such as for on-line candidate selection, where theselection of candidates has to be made rapidly, a simple decision treeis utilised, typically selecting a predetermined number of the candidatecustom instructions having the most frequent usage patterns, whileremaining within some size constraint. This technique allows for rapidselection, but may not lead to an optimal optimisation.

In another embodiment, and particularly suited to off-line candidateselection, selection of optimisation candidates is determined byutilising an objective decision function D. This decision is subject toconstraints, such as the available area and power, the impact on thecurrent FIP configuration and custom instructions.

In one embodiment the decision function D can take the form:

$\begin{matrix}{D = {\max {\sum\limits_{j = 1}^{n}{\frac{T_{sw}C_{{sw},j}F_{j}}{T_{ci}C_{{ci},j}F_{j}}S_{j}}}}} & (1)\end{matrix}$

Where: C_(sw,j) is the number of clock cycles to implement a softwarefunction ƒ( ).

-   -   T_(sw) is the cycle time for each clock cycle in the clock cycle        number C_(sw,j).    -   C_(ci,j) is the number of clock cycles to implement a custom        instruction.    -   T_(ci) is the cycle time for each clock cycle in the clock cycle        number C_(ci,j).    -   F_(j) is the number of times a procedure is called.    -   S_(j) is a binary selection variable, denoting whether the        custom instruction is implemented.

Based on the size and performance estimations as provided by the FIPinstantiator 6 for the candidate custom instructions, the FIP analyser 5approves ones of the custom instructions and the FIP adapter 33 proceedsto create the FIP configuration information 10, associated executableFIP code 13, and, where relevant, associated decision conditioninformation 7 for a new FIP implementation, which FIP configurationinformation 10, associated executable FIP code 13, and, where relevant,associated decision condition information 7 are loaded into thelibraries 25, 26, 27. The new FIP implementation, as provided by the FIPconfiguration information 10, associated executable FIP code 13 and,where relevant, associated decision condition information 7, can then beloaded into the re-programmable hardware 22 when the application is nextexecuted.

Optimisation Determination II (Re-Programming)

Re-programming of the re-programmable hardware 22 can occur as a resultof an explicit instruction in the executable FIP code 13, or anactuation instruction 32 to the system at run time, such as by the userpressing a button or keying in re-configuration instructions.

Where the re-programming is dynamic re-programming at run time, ifexecution data is available prior to the application executing,re-programming may be scheduled at run time. Otherwise, if there-programmable hardware 22 hosting the FIP is getting full, a schemesuch as least recently used (LRU) method may be used to decide whichcustom instructions are to remain implemented. This can be weighted byinformation taken from the run-time statistics 29 taken from therun-time monitor 28, so that the more frequently used custominstructions will be least likely to be swapped out.

In this embodiment a simple metric is used to determine whetherre-configuration of the re-programmable hardware 22 with a new FIPimplementation is beneficial. Where the FIP which is currently operatingexecutes at a particular speed, and a new, faster FIP is proposed as areplacement, the new FIP should only be adopted if the reduction in runtime is greater than the re-configuration time.

Consider a software function ƒ( ), which, when implemented by a ‘normal’instruction, requires C_(SW) clock cycles for execution, each having aeach cycle time T_(sw), and when implemented as a custom instruction,requires C_(ci) clock cycles for execution, each having a cycle timeT_(ci). Where the software function ƒ( ) is called F times over the timeperiod under investigation, in this embodiment one execution of theapplication, and the re-configuration time T_(r) for the re-programmablehardware 22, which includes the time for collecting and analysing data,the execution times for executing the software function t_(sw) and thecustom instruction t_(ci) can be given as:

t_(sw)=C_(sw)T_(sw)F  (2)

t_(ci)=C_(ci)T_(ci)F  (3)

A re-configuration ratio R can be defined as follows:

$\begin{matrix}{R = \frac{t_{sw}}{t_{ci} + T_{r}}} & (4) \\{{Thus}\text{:}} & \; \\{R = \frac{C_{sw}T_{sw}F}{{C_{ci}T_{ci}F} + T_{r}}} & (5)\end{matrix}$

More generally, with n custom instructions, the re-configuration ratio Rbecomes:

$\begin{matrix}{R = \frac{T_{sw}{\sum\limits_{j = 1}^{n}{C_{{sw},j}F_{j}}}}{{T_{ci}{\sum\limits_{j = 1}^{n}\left( {C_{{ci},j}F_{j}} \right)}} + T_{r}}} & (6)\end{matrix}$

The re-configuration threshold R_(T) is reached when R=1. Forre-configuration to be beneficial, the re-configuration ratio R has toexceed the re-configuration threshold R_(T). That is, the time takent_(sw) to execute the FIP code in software, as represented by the toppart of the re-configuration fraction, is greater than the time takent_(ci) to execute the FIP code, which includes custom instructions, forthe re-configured FIP plus the re-configuration time T_(r).

FIG. 11 graphically illustrates the effect of varying differentparameters on the re-configuration ratio R. The horizontal axisrepresents the number of times an application is executed F. Thevertical axis represents the re-configuration ratio R.

The lowermost curve, Curve A, represents a base FIP, whereC_(sw)T_(sw)=C_(ci)T_(ci). The re-configuration ratio R for the base FIPwill never exceed the re-configuration threshold R_(T), as there-configuration time T_(r) would have to be less than or equal to zero.

Curves B and C represent the base FIP where re-configured to incorporateone and two custom instructions, respectively. The general form of there-configuration ratio R, as given in equation (6), shows that as morecustom instructions are included, the re-configuration threshold R_(T)can be reached with fewer executions of the application. As more custominstructions are added and generic instructions are removed, the shapeof the re-configuration curve will tend towards that of a directhardware implementation.

Curve D represents the re-configuration ratio R for a FIP incorporatingtwo custom instructions, but operating at half the clock speed of thebase FIP, that is, where 2T_(ci)=T_(sw).

Curve E represents a FIP with two custom instructions and half there-configuration time T_(r) of the base FIP. Reducing there-configuration time T_(r) by half, increases the initial gradient ofthe re-configuration curve and reduces the number of applicationexecutions required to reach the re-configuration threshold R_(T). Fullre-configuration has been employed in the other exemplified FIPs, butpartial re-configuration can be employed. The re-configuration timeT_(r) can be re-written as the product of the re-configuration cycletime t_(r) and the number of re-configuration cycles n_(r) required tore-configure the re-programmable hardware 22. By utilising partialre-configuration, the number of re-configuration cycles n_(r) requiredcan be reduced [26, 30], and hence reduce its effect on there-configuration ratio R. The number of re-configuration cycles n_(r)may also be reduced through improvements in technology and architecturesthat support fast re-configuration though caches or context switches [6,31, 32].

Implementation

Operation of the run-time management system of the present inventionwill now be described hereinbelow by way of example with reference to animplementation of the advanced encryption standard (AES-Rijndael)algorithm [17] for the encryption and decryption of information. The AESalgorithm is an iterated block cipher with variable block and keylength. In this implementation the FIPs are assumed to run at 100 MHz.Also data collection and analysis is conducted in parallel with the FIPexecution, and thus does not introduce any performance penalty.

In the AES implementation, the most frequently executed procedure is theprocedure FFmulx, a procedure defined by the AES standard. Of allprocedure calls, 74% of those calls can be attributed to FFmulx.

The Java implementation of the FFmulx procedure is given hereinbelow.

const byte m_poly = 0×16; public byte FFmulx (byte a) { return (byte)((a<<1) {circumflex over ( )} ((a & 0×80) ! = 0 ? m_poly : 0)); }

FIG. 12 illustrates the sequential implementation of the procedureFFmulx in Java opcodes, and chained and folded opcodes as anoptimisation of the base opcodes.

The left-hand column represents the Java opcodes required to implementthe procedure FFmulx. In this embodiment the Java opcode implementationtakes 26.5 clock cycles on average, plus an additional 4 clock cyclesfor procedure pre-amble and post-amble routines. Depending on theoutcome of the conditional branch IFEQ opcode, this implementation takesfrom 25 to 28 clock cycles to execute.

The right-hand column represents the result of optimisation by bothopcode chaining and instruction folding. Opcode chaining involvesstoring intermediate results in temporary registers. By removing theneed to push and pop values from the stack, the sequential structureimposed by the stack is eliminated. Next, instruction folding isapplied. Instruction folding allows several opcodes to be combined orfolded into one instruction. In this way, several stack-basedinstructions are converted into one register-based instruction.Furthermore, since the procedure FFmulx is replaced by a singleinstruction, there is no longer a need to perform the pre-amble andpost-amble routines which are necessary for procedural calls. Thisoptimisation reduces the number of clock cycles in each applicationexecution from about 30 clock cycles to 8.5 clock cycles on average.

FIG. 13 represents a further optimisation of the procedure FFmulx. Inthis optimisation, custom instructions are executed in parallel byexploiting data dependence. In FIG. 13, instructions on the same levelare executed in the same clock cycle, with the arrows denoting the datadependency. This implementation follows ideas in VLIW/EPIC architectures[33], such as multiple issue and predicated execution, and resemblesdirect hardware implementation. With this optimisation, the cycle countis reduced to 6 cycles.

Using the above optimisations, the original software function for theprocedure FFmulx has been optimised from 30 cycles to 6 cycles,producing a five-fold speed-up.

FIG. 14 illustrates a graph of the re-configuration ratio R against thenumber of 128-bit data blocks encrypted by FIPs implementing parallelcustom instructions with different key widths. Re-configurationcalculations are based on the time for full re-configuration of the FIPimplementation on a Xilinx XCV 1000 chip (Xilinx Inc.).

Recall from equation (6) that re-configuration of the re-programmablehardware 22 is beneficial where the re-configuration ratio R greaterthan one. From FIG. 14, it can be seen that, with a 128-bit encryptionkey, about 650 blocks of 128-bit data would have to processed beforere-configuration would be beneficial. This translates to about 10 Kbytesof data. It will be seen that re-configuration becomes progressivelymore beneficial as the size of the encryption key increases. With a192-bit key, about 8.6 Kbytes would have to be processed beforere-configuration became beneficial. And, with a 256-bit key, about 7Kbytes would have to be processed before re-configuration becamebeneficial.

The AES specification [17] suggests that the AES algorithm could beaccelerated by unrolling several of the AES functions intolook-up-tables. Speeds of up to 7 Gbits/s have been reported [34] usingblock RAMs in Xilinx Virtex-E chips (Xilinx Inc.) for such purposes.Custom instructions designed for FIPs can also make use of suchtechniques.

In this regard, various FIP implementations have been developed tosupport AES algorithm. FIG. 15 graphically illustrates the relativeperformance of these FIP implementations, where the key size and blocksize is 256 bits. FIG. 16 illustrates the speed-ups corresponding tothese FIP implementations.

AES1 is a first FIP implementation of the AES algorithm which has beencustomised by removing hardware associated with unused opcodes in thegeneric or base FIP, but does not contain any custom instructions.

AES2 is a second FIP implementation of the AES algorithm whichincorporates three custom instructions which will speed-up bothencryption and decryption. AES2 contains the FFmulx customisationdescribed hereinabove and two additional custom instructions. Thesethree custom instructions speed up both encryption and decryption. Theimprovement is 1.3 times for encryption and 3.6 times for decryption.The new custom instructions replace the functionality of some opcodes,with the opcodes which are no longer used being removed to provide morearea for the custom instructions. Thus, the trade-off is that AES2 isless flexible than AES1, since some routines executable in AES1 may nolonger be executable on AES2.

AES3 is a third FIP implementation of the AES algorithm which providesfor further refinement in AES encryption. AES3 incorporates a new custominstruction which replaces the inner loop for encryption. More resourceis given to the custom instruction that speeds up encryption byutilising look-up-tables, however the two additional custom instructionsadded in AES2 have to be removed to make space for this new instruction.As a result, the improvement in encryption performance is 5.2 times ascompared to AES1, whereas decryption performance is only 1.4 times. Thetrade-off, however, is that the two additional custom instructionsintroduced in AES2 have to be removed to make space for this new custominstruction. So, while the encryption speed is improved, this is at theexpense of the decryption speed.

AES4 is a fourth FIP implementation of the AES algorithm which providesfor refinement of AES decryption. AES4 incorporates a new custominstruction which provides a five-fold decryption speed-up over AES2,but with similar trade-offs as for AES3.

These results suggest a strategy for re-configuration. Where encryptionis used more often than decryption, AES3 should be employed. On theother hand, where decryption is used more often, AES4 should beemployed. Where no information about usage is available, AES2 should beemployed. Similar optimisation strategies can be applied to applicationswhere the run-time conditions change and favour different FIPimplementations at different times.

For implementation of the AES algorithm, initially, a generic FIP, suchas a JVM, is used to execute the AES algorithm. At design time, the FIPdesigner can introduce custom instructions to accelerate the executionof the AES algorithm. After deploying the system, the run-time monitor28 in the run-time manager 18 would record the execution patterns of theuser. Consider that the run-time monitor 28 shows that AES decryption isused more frequently and on larger block sizes than AES encryption. Sucha situation would arise, for example, when a user downloads bankinginformation for browsing, and sends back relatively smaller size datafor transactions. The optimisation analyser 30 would consequentlyrequest optimisation of the FIP implementation. Custom instructionswould created by the adaptation unit 31 and the FIP implementation AES4would be created. The run-time manager would then determine whether itwould be advantageous to re-configure to the new FIP implementation.

III—Tools Debug Tools

The debug tool provides a way for users to trace through code in theevent of a crash, during simulation or execution. After severaladaptations, various FIP configurations could be in use during theexecution of an application. The debug tool assists in identifyingerrors by tracing through the correct FIP design, and provides theability to expand into custom instructions, revealing the originalopcodes that are used to create the custom instruction. FIG. 17diagrammatically illustrates the debug tool. The left-hand box, box A,contains the original code. The central box, box B, contains the newcode, after adaptation. The right-hand box, box C, shows the FIPconfiguration information 10 and the associated executable FIP code 13.During debug, a user needs to know which FIP is running, and also whatopcodes are used to create a custom instruction, such as codeA.

Compiler Tools

FIG. 18 illustrates the compiler tool. The compiler tool allows the userto enter application code and compile the application code into machinecode. The compiled code can then be profiled and inspected so that a FIPcan be optimised to execute the application. The compiler tool cansuggest custom instructions to implement or allow the user to createcustom instructions. The right-hand pop-up box illustrates this feature,whereby a user is allowed to create a new custom instruction and a newcustom instruction is also proposed by the compiler tool, this being thecustom instruction nextNum.

When a FIP design is acceptable, FIP configuration information 10 andthe associated executable FIP code 13 is generated for the FIPimplementation.

As mentioned earlier, the compiler tool can be used to staticallydetermine re-configuration. In this embodiment the compiler tool alsoprovides a means for the user to specify their re-configuration ratio Rand the re-configuration threshold R_(T) at which re-configuration willbe attempted. The compiler tool also allows the user to tune therun-time environment in terms of pre-fetch and re-configurationstrategies.

The compiler tool also provides a means for users to provide informationwhich will act as indications in optimising the FIP. By way of example,specification criteria include: (i) load balancing; for example, where auser knows that an adder will be used 60% of the time and a multiplierused only 5% of the time, more resources should be dedicated to theadder, to increase the speed of execution of the FIP; (ii) throughput;that is, the results produced per unit time; (iii) the size of theexecutable FIP code; and (iv) the size of FIP configuration.

Finally, it will be understood that the present invention has beendescribed in its preferred embodiments and can be modified in manydifferent ways without departing from the scope of the invention asdefined by the appended claims.

Further, it is to be understood that the contents of all of thedocuments cited herein are incorporated by reference.

REFERENCES

-   [1] H. Styles and W. Luk. Customising graphics applications:    techniques and programming interface. In Proc. IEEE Symp. On Field    Programmable Custom Computing Machines. IEEE Computer Society Press,    2000.-   [2] J. A. Fisher. Customized instruction sets for embedded    processors. In Proc. 36^(th) Design Automation Conference, pp.    253-257, 1999.-   [3] Altera Corporation. Excalibur Embedded Processor Solutions.    http://www.altera.com/html/products/excalibursplash.html-   [4] Triscend. The Configurable System on a Chip.    http://www.triscend.com/products/index.html.-   [5] Xilinx. IBM and Xilinx team to create new generation of    integrated circuits. http://www.xilinx.com/prs_rls/ibmpartner.htm.-   [6] U.S. Pat. No. 5,752,035-   [7] J. Gray, Building a RISC system in an FPGA. In Circuit Cellar:    The magazine for computer applications. pp. 20-27, March 2000.-   [8] M. J. Wirthlin and K. L. Gilson. The nano processor: a low    resource reconfigurable processor. In Proc. IEEE Symp. on Field    Programmable Custom Computing Machines, pp. 23-30. IEEE Computer    Society Press, 1994.-   [9] A. Donlin. Self-modifying circuitry—a platform for tractable    virtual circuitry. In Field Programmable Logic and Applications,    LNCS 1482, pp. 199-208. Springer, 1998.-   [10] M. Wirthlin and B. Hutchings. A dynamic instruction set    computer. In Proc. IEEE Symp. on Field Programmable Custom Computing    Machines, pp. 99-107. IEEE Computer Society Press, 1995.-   [11] I. Page. Automatic design and implementation of    microprocessors. In Proc. WoTUG-17, pp. 190-204. IOS Press, 1994.-   [12] C. Cladingboel. Hardware compilation and the Java abstract    machine. M.Sc. Thesis, Oxford University Computing Laboratory, 1997.-   [13] C. J. G. North. Graph reduction in hardware. M.Sc. Thesis,    Oxford University Computing Laboratory, 1992.-   [14] R. Watts. A parameterised ARM processor. Technical Report,    Oxford University Computing Laboratory, 1993.-   [15] WO-A-00/46704-   [16] N. Shirazi, W. Luk and P. Y. K. Cheung. Framework and tools for    run-time reconfigurable designs. IEE Proc.-Comput. Digit. Tech.,    147(3), pp. 147-152, May 2000.-   [17] National Institute of Standards and Technology. Advanced    Encryption Standard. http://csrc.nist.gov/encryption/acs.-   [18] Celoxica. Handel-C Production Information.    http://www.celoxica.com.-   [19] J. He, G. Brown, W. Luk and J. O'Leary. Deriving two-phase    modules for a multi-target hardware compiler. In Proc. 3^(rd)    Workshop on Designing Correct Circuits. Springer Electronic Workshop    in Computing Series, 1996,    http://www.ewic.org.uk/ewic/workshop/view.cfm/DOC-96.-   [20] W. Luk, J. Gray, D. Grant, S. Guo, S. McKeever, N. Shirazi, M.    Dean, S. Seng and K. Teo. Reusing intellectual property with    parameterised hardware libraries. In Advances in Information    Technologies: The Business Challenge, pp. 788-795. IOS Press, 1997.-   [21] Xilinx. Relationally Placed Macros.    http://toolbox.xilinx.com/docsan/2_li/data/common/lib/lib2_(—)2.htm.-   [22] Altera Corporation. Megafunctions.    http://www.altera.com/html/mega/mega.html-   [23] T. Lindholm and F. Yellin. The Java Virtual Machine    Specification (2^(nd) Ed.). Addison-Wesley, 1999.-   [24] Sun Microsystems. PicoJava™ specification.    http://www.sun.com/microelectronics/picoJava.-   [25] Pendragon Software Corporation. CaffineMark 3.0 Java Benchmark.    http://www.pendragon-software.com/pendragon/cm3/index.html.-   [26] N. Shirazi, W. Luk and P. Y. K. Cheung. Run-time management of    dynamically reconfigurable designs. In Field Programmable Logic and    Applications, pp. 59-68, Springer 1998.-   [27] S. McMillan and S. A. Guccione. Partial run-time    reconfiguration using JRTR. In Field Programmable Logic and    Applications, LNCS 1896, pp. 352-360. Springer, 2000.-   [28] H. McGhan and M. O'Connor. PicoJava: a direct execution engine    for Java bytecode. IEEE Computer, pp. 22-30, October 1998.-   [29] R. Espasa and M. Valero. Exploiting instruction and data level    parallelism. IEEE Micro, pp. 20-27, September/October 1997.-   [30] N. Shirazi, D. Benyamin, W. Luk, P. Y. K. Cheung and S. Guo.    Quantitative analysis of FPGA-based database searching. Journal of    VLSI Signal Processing, pp. 85-96, May/June 2001.-   [31] S. Scalera and J. Vázquez. The design and implementation of a    context switching FPGA. In Proc. IEEE. Symp. on Field Programmable    Custom Computing Machines. IEEE Computer Society Press, 1998.-   [32] S. Trimberger, D. Carberry, and A. Johnson. A time-multiplexed    FPGA. In Proc. IEEE Symp. on Field Programmable Custom Computing    Machines, pp. 22-28, IEEE Computer Society Press, 1997.-   [33] K. V. Palem, S. Talla, and P. W. Devaney. Adaptive explicitly    parallel instruction computing. In Proc. 4^(th) Australasian    Computer Architecture Conf. Springer Verlag, 1999.-   [34] M. McLoone and J. McCanny. Single-chip FPGA implementation of    the Advanced Encryption Standard algorithm. In Field Programmable    Logic and Applications. Springer, 2001.

1. A method of generating configuration information and associatedexecutable code based on a customisation specification, which includesapplication information including application source code andcustomisation information including design constraints, for implementingan instruction processor using re-programmable hardware, the methodcomprising the steps of: generating a template for each processor styleidentified as a candidate for implementation, each template comprising aprocessor definition and associated parameters for implementing therespective processor style in the re-programmable hardware, andincorporating the customisation information from the customisationspecification; analysing instruction information for each template anddetermining instruction optimisations; compiling the application sourcecode to include the instruction optimisations and generate executablecode; analysing architecture information for each template anddetermining architecture optimisations; generating first configurationinformation including the architecture optimisations; and generatingsecond, device-specific configuration information from the firstconfiguration information including the architecture optimisations. 2.The method of claim 1, further comprising the steps of: profiling thefirst configuration information and the executable code for eachcandidate implementation; and in response thereto selecting one or moreoptimal implementations based on predeterminable criteria.
 3. The methodof claim 1, wherein the customisation information further includes atleast one custom instruction.
 4. The method of claim 1, furthercomprising the steps of: profiling information in the customisationspecification; and identifying at least one processor style as acandidate for implementation.
 5. The method of claim 4, whereinprofiling information for enabling optimisation is collected in thecustomisation specification profiling step.
 6. The method of claim 5,wherein the instruction information analysis step comprises the stepsof: utilising the profiling information in analysing the instructioninformation; and determining the instruction optimisations therefrom. 7.The method of claim 1, wherein the instruction information analysis stepcomprises the steps of: identifying candidate instruction optimisations;and determining implementation of the instruction optimisations based onestimations performed based on instantiation of the candidateinstruction optimisations.
 8. The method of claim 7, wherein, where theestimations provide that the re-programmable hardware cannot beprogrammed to implement all instructions together during run time, theinstruction information analysis step comprises the step of: groupingcombined ones of instructions into sets of instructions which can beimplemented by re-programming of the re-programmable hardware.
 9. Themethod of claim 1, wherein the instruction information analysis stepcomprises the steps of: determining a plurality of implementations fordifferent run-time conditions, each having instruction optimisationsassociated with the run-time conditions; and generating decisioncondition information associated with each implementation, whichdecision condition information enables selection between theimplementations depending on actual run-time conditions.
 10. The methodof claim 4, wherein, where the instruction optimisations cannot providean implementation which complies with design constraints, theinstruction information analysis step comprises the step of: invokingthe customisation specification profiling step to re-profile thecustomisation specification based on analysis information provided bythe instruction information analysis step.
 11. The method of claim 1,wherein the architecture optimisations include pipelining.
 12. Themethod of claim 1, wherein, where a plurality of configurations of there-programmable hardware are required to implement the instructionprocessor, further comprising the steps of: optimising ones of theconfigurations into groups; and scheduling implementation of the groupedconfigurations.
 13. The method of claim 1, wherein each template isgenerated from processor definitions and associated parameters extractedfrom a library containing processor definitions and associatedparameters for a plurality of processor styles.
 14. The method of claim1, wherein a compiler utilised in compiling the application source codeis generated in the instruction information analysis step, and thecompiling step comprises the steps of: annotating the application sourcecode with customisation information; and compiling the annotated sourcecode to provide an optimised executable code.
 15. The method of claim 1,wherein the compiling step comprises the steps of: compiling theapplication source code; and re-organising the compiled source code toincorporate optimisations to provide an optimised executable code. 16.The method of claim 9, further comprising the step of: deploying theconfiguration information and the executable code, and, where relevant,the decision condition information, in at least one management systemwhich is for managing re-configuration of instruction processorsimplemented using re-programmable hardware.
 17. The method of claim 9,further comprising the step of: deploying the configuration informationand the executable code, and, where relevant, the decision conditioninformation, in at least one library for enabling re-programming ofre-programmable hardware.
 18. A design system for generatingconfiguration information and associated executable code based on acustomisation specification, which includes application informationincluding application source code and customisation informationincluding design constraints, for implementing an instruction processorusing re-programmable hardware, the system comprising: a templategenerator for generating a template for each processor style identifiedas a candidate for implementation, each template comprising a processordefinition and associated parameters for implementing the respectiveprocessor style in the re-programmable hardware, and incorporating thecustomisation information from the customisation specification; ananalyser for analysing instruction information for each template anddetermining instruction optimisations; a compiler for compiling theapplication source code to include the instruction optimisations andgenerate executable code; an instantiator for analysing architectureinformation for each template, determining architecture optimisationsand generating configuration information including the architectureoptimisations; and a builder for generating device-specificconfiguration information from the configuration information includingthe architecture optimisations.
 19. A method of managing run-timere-configuration of an instruction processor implemented inre-programmable hardware, comprising the steps of: providing aconfiguration library containing configuration information for aplurality of instruction processor implementations; providing a codelibrary for containing executable code for the implementations; loadingapplication data and, as required, the configuration information and theexecutable code into re-programmable hardware for implementation andexecution of an instruction processor; executing the executable code;obtaining run-time statistics relating to operation of the instructionprocessor; and loading the configuration information and the executablecode for a new implementation into the re-programmable hardware.
 20. Themethod of claim 19, wherein the loading step is performed automaticallyon a predeterminable event.
 21. The method of claim 19, wherein theloading step is actuated by an external agent.
 22. The method of claim21, further comprising the step of: loading the configurationinformation and the executable code for a new implementation into therespective ones of the configuration library and the code library priorto the loading step; and wherein the loading step comprises the step ofloading the configuration information and the executable code for thatimplementation into the re-programmable hardware on actuation by anexternal agent.
 23. The method of claim 19, further comprising the stepsof: providing a decision condition library for containing associateddecision condition information for at least ones of the implementations;profiling the decision condition information for a plurality of otherimplementations for various run-time conditions of the implementationloaded in the re-programmable hardware; determining whether the decisioncondition information for any of the other implementations more closelyfits the run-time statistics; and wherein, where the decision conditioninformation for one of the other implementations more closely fits therun-time statistics, the loading step comprises the step of: loading theconfiguration information and the executable code for thatimplementation into the re-programmable hardware.
 24. The method ofclaim 19, further comprising the step of: generating one or more newimplementations optimised to the run-time statistics.
 25. The method ofclaim 24, further comprising the step of: loading the configurationinformation and the executable code for each new implementation intorespective ones of the configuration library and the code library. 26.The method of claim 24, further comprising the step of: loading theconfiguration information, the executable code and the decisioncondition information for each new implementation into respective onesof the configuration library, the code library and the decisioncondition library.
 27. The method of claim 19, wherein the configurationinformation and the executable code for a new implementation are loadedinto the re-programmable hardware on satisfaction of predefinedcriteria.
 28. The method of claim 24, wherein the implementationgenerating step comprises the steps of: analysing instructioninformation based on the run-time statistics and determining instructionoptimisations; compiling application source code to include theinstruction optimisations and generate executable code; analysingarchitecture information based on the run-time statistics anddetermining architecture optimisations; generating first configurationinformation including the architecture optimisations; and generatingsecond, device-specific configuration information from the firstconfiguration information including the architecture optimisations. 29.The method of claim 28, wherein the implementation generating stepfurther comprises the steps of: profiling the first configurationinformation and the executable code for each candidate implementation;and in response thereto selecting one or more optimal implementationsbased on predefined criteria.
 30. The method of claim 28, wherein theimplementation generating step further comprises the steps of: profilinginformation in a customisation specification and the run-timestatistics; identifying at least one processor style as a candidate forimplementation; and generating a template for each processor styleidentified as a candidate for implementation.
 31. The method of claim30, wherein profiling information for enabling optimisation is collectedin the customisation specification profiling step.
 32. The method ofclaim 31, wherein the instruction information analysis step comprisesthe steps of: utilising the profiling information in analysing theinstruction information; and determining the instruction optimisationstherefrom.
 33. The method of claim 28, wherein the instructioninformation analysis step comprises the steps of: identifying candidateinstruction optimisations; and determining implementation of theinstruction optimisations based on estimations performed based oninstantiation of the candidate instruction optimisations.
 34. The methodof claim 33, wherein, where the estimations provide that there-programmable hardware cannot be programmed to implement allinstructions together during run time, the instruction informationanalysis step comprises the step of: grouping combined ones ofinstructions into sets of instructions which can be implemented byre-programming of the re-programmable hardware.
 35. The method of claim28, wherein the instruction information analysis step comprises thesteps of: determining a plurality of implementations for differentrun-time conditions, each having instruction optimisations associatedwith the run-time conditions; and generating decision conditioninformation associated with each implementation, which decisioncondition information enables selection between the implementationsdepending on actual run-time conditions.
 36. The method of claim 30,wherein, where the instruction optimisations cannot provide animplementation which complies with design constraints, the instructioninformation analysis step comprises the step of: invoking thecustomisation specification profiling step to re-profile thecustomisation specification based on analysis information provided bythe instruction information analysis step.
 37. The method of claim 28,wherein the architecture optimisations include pipelining.
 38. Themethod of claim 28, wherein, where a plurality of configurations of there-programmable hardware are required to implement the instructionprocessor, further comprising the steps of: optimising ones of theconfigurations into groups; and scheduling implementation of the groupedconfigurations.
 39. The method of claim 30, wherein each template isgenerated from processor definitions and associated parameters extractedfrom a library containing processor definitions and associatedparameters for a plurality of processor styles.
 40. The method of claim28, wherein a compiler utilised in compiling the application source codeis generated in the instruction information analysis step, and thecompiling step comprises the steps of: annotating the application sourcecode with customisation information; and compiling the annotated sourcecode to provide an optimised executable code.
 41. The method of claim28, wherein the compiling step comprises the steps of: compiling theapplication source code; and re-organising the compiled source code toincorporate optimisations to provide an optimised executable code.
 42. Amanagement system for managing run-time re-configuration of aninstruction processor implemented using re-programmable hardware,comprising: a configuration library containing configuration informationfor a plurality of instruction processor implementations; a code libraryfor containing executable code for the implementations; a loader forloading application data and, as required, the configuration informationand the executable code into re-programmable hardware for implementationand execution of an instruction processor; a loader controller forsignalling the loader to load application data and, as required, theconfiguration information and the executable code, and execute theexecutable code; a run-time monitor for obtaining run-time statisticsrelating to operation of the instruction processor; an optimisationdeterminer configured to receive the run-time statistics, and beingoperable to instruct the loader to load the configuration informationand the executable code for a new implementation into there-programmable hardware; and an optimisation instructor for invokingthe optimisation determiner.
 43. The method of claim 1, wherein the stepof generating a template for each processor style identified as acandidate for implementation comprises the step of: generating aplurality of templates for different processor styles, includingregister-based and stack-based styles.